Publications

Learning manifolds from non-stationary streams

Published in Springer Nature Journal of Big Data, 2024

Streaming adaptations of manifold learning based dimensionality reduction methods, such as Isomap, are based on the assumption that a small initial batch of observations is enough for exact learning of the manifold, while remaining streaming data instances can be cheaply mapped to this manifold. However, there are no theoretical results to show that this core assumption is valid. Moreover, such methods typically assume that the underlying data distribution is stationary and are not equipped to detect, or handle, sudden changes or gradual drifts in the distribution that may occur when the data is streaming. We present theoretical results to show that the quality of a manifold asymptotically converges as the size of data increases. We then show that a Gaussian Process Regression (GPR) model, that uses a manifold-specific kernel function and is trained on an initial batch of sufficient size, can closely approximate the state-of-art streaming Isomap algorithms, and the predictive variance obtained from the GPR prediction can be employed as an effective detector of changes in the underlying data distribution. Results on several synthetic and real data sets show that the resulting algorithm can effectively learn lower dimensional representation of high dimensional data in a streaming setting, while identifying shifts in the generative distribution. For instance, key findings on a Gas sensor array data set show that our method can detect changes in the underlying data stream, triggered due to real-world factors, such as introduction of a new gas in the system, while efficiently mapping data on a low-dimensional manifold.

Mahapatra, Suchismit, and Varun Chandola. "Learning manifolds from non-stationary streams." Journal of Big Data 11, no. 1 (2024): 42. https://journalofbigdata.springeropen.com/articles/10.1186/s40537-023-00872-8

New Methods & Metrics for LFQA tasks

Published in arXiv preprint, 2021

Long-form question answering (LFQA) tasks require retrieving the documents pertinent to a query, using them to form a paragraphlength answer. Despite considerable progress in LFQA modeling, fundamental issues impede its progress: i) train/validation/test dataset overlap, ii) absence of automatic metrics and iii) generated answers not being “grounded” in retrieved documents. This work addresses every one these critical bottlenecks, contributing natural language inference/generation (NLI/NLG) methods and metrics that make significant strides to their alleviation.

Mahapatra, Suchismit, Vladimir Blagojevic, Pablo Bertorello, and Prasanna Kumar. "New Methods & Metrics for LFQA tasks." arXiv preprint arXiv:2112.13432 (2021).

Interpretable Graph Similarity Computation via Differentiable Optimal Alignment of Node Embeddings

Published in SIGIR 2021, the 44th International Conference on Research and Development in Information Retrieval, 2021

Computing graph similarity is an important task in many graphrelated applications such as retrieval in graph databases or graph clustering. While numerous measures have been proposed to capture the similarity between a pair of graphs, Graph Edit Distance (GED) and Maximum Common Subgraphs (MCS) are the two widely used measures in practice. GED and MCS are domain-agnostic measures of structural similarity between the graphs and define the similarity as a function of pairwise alignment of different entities (such as nodes, edges, and subgraphs) in the two graphs. The explicit explainability offered by the pairwise alignment provides transparency and justification of the similarity score, thus, GED and MCS have important practical applications. However, their exact computations are known to be NP-hard. While recently proposed neural-network based approximations have been shown to accurately compute these similarity scores, they have limited ability in providing comprehensive explanations compared to classical combinatorial algorithms, e.g., Beam search. This paper aims at efficiently approximating these domain-agnostic similarity measures through a neural network, and simultaneously learning the alignments (i.e., explanations) similar to those of classical intractable methods. Specifically, we formulate the similarity between a pair of graphs as the minimal “transformation” cost from one graph to another in the learnable node-embedding space. We show that, if node embedding is able to capture its neighborhood context closely, our proposed similarity function closely approximates both the alignment and the similarity score of classical methods. Furthermore, we also propose an efficient differentiable computation of our proposed objective for model training. Empirically, we demonstrate that the proposed method achieves up to 50%-100% reduction in the Mean Squared Error for the graph similarity approximation task and up to 20% improvement in the retrieval evaluation metrics for the graph retrieval task.

Doan, Khoa D., Saurav Manchanda, Suchismit Mahapatra, and Chandan K. Reddy. "Interpretable graph similarity computation via differentiable optimal alignment of node embeddings." In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 665-674. 2021.

Discretized Bottleneck: Posterior-Collapse-Free Sequence-to-Sequence Learning

Published in arXiv preprint, 2020

Variational autoencoders (VAEs) are important tools in end-to-end representation learning. VAEs can capture complex data distributions and have been applied extensively in many NLP tasks. However, a common pitfall in sequence-to-sequence learning with VAEs is the posterior-collapse issue in latent space, wherein the model tends to ignore latent variables when a strong auto-regressive decoder is implemented. In this paper, we propose a principled approach to eliminate this issue by applying a discretized bottleneck in the latent space. Specifically, we impose a shared discrete latent space where each input is learned to choose a combination of shared latent atoms as its latent representation. Compared with VAEs employing continuous latent variables, our model endows more promising capability in modeling underlying semantics of discrete sequences and can thus provide more interpretative latent structures. Empirically, we demonstrate the efficiency and effectiveness of our model on a broad range of tasks, including language modeling, unaligned text style transfer, dialog response generation, and neural machine translation.

Zhao, Yang, Ping Yu, Suchismit Mahapatra, Qinliang Su, and Changyou Chen. "Discretized Bottleneck: Posterior-Collapse-Free Sequence-to-Sequence Learning." (2020). https://arxiv.org/abs/2004.10603

S-Isomap++: Multi Manifold Learning from Streaming Data

Published in 2017 IEEE International Conference on Big Data (Big Data), 2018

Manifold learning based methods have been widely used for non-linear dimensionality reduction (NLDR). However, in many practical settings, the need to process streaming data is a challenge for such methods, owing to the high computational complexity involved. Moreover, most methods operate under the assumption that the input data is sampled from a single manifold, embedded in a high dimensional space. We propose a method for streaming NLDR when the observed data is either sampled from multiple manifolds or irregularly sampled from a single manifold. We show that existing NLDR methods, such as Isomap, fail in such situations, primarily because they rely on smoothness and continuity of the underlying manifold, which is violated in the scenarios explored in this paper. However, the proposed algorithm is able to learn effectively in presence of multiple, and potentially intersecting, manifolds, while allowing for the input data to arrive as a massive stream.

Mahapatra, Suchismit, and Varun Chandola. "S-Isomap++: Multi manifold learning from streaming data." In 2017 IEEE International Conference on Big Data (Big Data), pp. 716-725. IEEE, 2017. http://ieeexplore.ieee.org/document/8257987/

Error Metrics for Learning Reliable Manifolds from Streaming Data

Published in Proceedings of the 2017 SIAM International Conference on Data Mining, 2017

Spectral dimensionality reduction is frequently used to identify low-dimensional structure in high-dimensional data. However, learning manifolds, especially from the streaming data, is computationally and memory expensive. In this paper, we argue that a stable manifold can be learned using only a fraction of the stream, and the remaining stream can be mapped to the manifold in a significantly less costly manner. Identifying the transition point at which the manifold is stable is the key step. We present error metrics that allow us to identify the transition point for a given stream by quantitatively assessing the quality of a manifold learned using Isomap. We further propose an efficient mapping algorithm, called S-Isomap, that can be used to map new samples onto the stable manifold. We describe experiments on a variety of data sets that show that the proposed approach is computationally efficient without sacrificing accuracy.

Schoeneman, Frank, Suchismit Mahapatra, Varun Chandola, Nils Napp, and Jaroslaw Zola. "Error metrics for learning reliable manifolds from streaming data." In Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 750-758. Society for Industrial and Applied Mathematics, 2017. https://epubs.siam.org/doi/10.1137/1.9781611974973.84

Modeling Graphs Using a Mixture of Kronecker Models

Published in Proceedings of the 3rd IEEE International Conference on Big Data, 2015

Generative models for graphs are increasingly becoming a popular tool for researchers to generate realistic approximations of graphs. While in the past, focus was on generating graphs which follow general laws, such as the power law for degree distribution, current models have the ability to learn from observed graphs and generate synthetic approximations. The primary emphasis of existing models has been to closely match different properties of a single observed graph. Such models, though stochastic, tend to generate samples which do not have significant variance in terms of the various graph properties. We argue that in many cases real graphs are sampled drawn from a graph population (e.g., networks sampled at various time points, social networks for individual schools, healthcare networks for different geographic regions, etc.). Such populations typically exhibit significant variance. However, existing models are not designed to model this variance, which could lead to issues such as overfitting. We propose a graph generative model that focuses on matching the properties of real graphs and the natural variance expected for the corresponding population. The proposed model adopts a mixture-model strategy to expand the expressiveness of Kronecker product based graph models (KPGM), while building upon the two strengths of KPGM, viz., ability to model several key properties of graphs and to scale to massive graph sizes using its elegant fractal growth based formulation. The proposed model, called x-Kronecker Product Graph Model, or xKPGM, allows scalable learning from observed graphs and generates samples that match the mean and variance of several salient graph properties. We experimentally demonstrate the capability of the proposed model to capture the inherent variability in real world graphs on a variety of publicly available graph data sets.

Mahapatra, Suchismit, and Varun Chandola. "Modeling graphs using a mixture of Kronecker models." In 2015 IEEE international conference on big data (big data), pp. 727-736. IEEE, 2015. http://ieeexplore.ieee.org/document/7363817/