Home
[Publications]
Patents
Datasets
Contact
Teaching
my pic

Publications

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each authors copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

2017

AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos
A. Kar, N. Rai, K. Sikka, G. Sharma
Computer Vision and Pattern Recognition (CVPR)
Hawaii, USA, July 2017 (to appear)
PDF       arXiv   Project page  
Abstract: We propose a novel method for temporally pooling frames in a video for the task of human action recognition. The method is motivated by the observation that there are only a small number of frames which, together, contain sufficient information to discriminate an action class present in a video, from the rest. The proposed method learns to pool such discriminative and informative frames, while discarding a majority of the non-informative frames in a single temporal scan of the video. Our algorithm does so by continuously predicting the discriminative importance of each video frame and subsequently pooling them in a deep learning framework. We show the effectiveness of our proposed pooling method on standard benchmarks where it consistently improves on baseline pooling methods, with both RGB and optical flow based Convolutional networks. Further, in combination with complementary video representations, we show results that are competitive with respect to the state-of-the-art results on two challenging and publicly available benchmark datasets.
@inproceedings{adascan2017,
            title={AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos},
            author={Amlan Kar and Nishant Rai and Karan Sikka and Gaurav Sharma}
            booktitle={CVPR},
            year={2017}
}


An Empirical Evaluation of Visual Question Answering for Novel Objects
S. K. Ramakrishnan, A. Pal, G. Sharma, A. Mittal
Computer Vision and Pattern Recognition (CVPR)
Hawaii, USA, July 2017 (to appear)
PDF       arXiv  
Abstract: We study the problem of answering questions about images in the harder setting, where the test questions and corresponding images contain novel objects, which were not queried about in the training data. Such setting is inevitable in real world-owing to the heavy tailed distribution of the visual categories, there would be some objects which would not be annotated in the train set. We show that the performance of two popular existing methods drop significantly (up to 28%) when evaluated on novel objects cf. known objects. We propose methods which use large existing external corpora of (i) unlabeled text, i.e. books, and (ii) images tagged with classes, to achieve novel object based visual question answering. We do systematic empirical studies, for both an oracle case where the novel objects are known textually, as well as a fully automatic case without any explicit knowledge of the novel objects, but with the minimal assumption that the novel objects are semantically related to the existing objects in training. The proposed methods for novel object based visual question answering are modular and can potentially be used with many visual question answering architectures. We show consistent improvements with the two popular architectures and give qualitative analysis of the cases where the model does well and of those where it fails to bring improvements.
@inproceedings{novel2017,
            title={An Empirical Evaluation of Visual Question Answering for Novel Objects},
            author={Santhosh K. Ramakrishnan and Ambar Pal and Gaurav Sharma and Anurag Mittal}
            booktitle={CVPR},
            year={2017}
}


Expanded Parts Model for Semantic Description of Humans in Still Images
G. Sharma, F. Jurie, C. Schmid
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 39(1):87-101, 2017
PDF       arXiv
Abstract: We introduce an Expanded Parts Model (EPM) for recognizing human attributes (e.g. young, short hair, wearing suit) and actions (e.g. running, jumping) in still images. An EPM is a collection of part templates which are learnt discriminatively to explain specific scale-space regions in the images (in human centric coordinates). This is in contrast to current models which consist of a relatively few (i.e. a mixture of) 'average' templates. EPM uses only a subset of the parts to score an image and scores the image sparsely in space, i.e. it ignores redundant and random background in an image. To learn our model, we propose an algorithm which automatically mines parts and learns corresponding discriminative templates together with their respective locations from a large number of candidate parts. We validate our method on three recent challenging datasets of human attributes and actions. We obtain convincing qualitative and state-of-the-art quantitative results on the three datasets.
@article{sharma2016epm,
            title={Expanded Parts Model for Semantic Description of Humans in Still Images},
            author={Gaurav Sharma and Frederic Jurie and Cordelia Schmid}
            journal={TPAMI},
            volume={39},
            number={1},
            pages={87--101},
            year={2017}
}


Large Scale Novel Object Discovery in 3D
S. Srivastava, G. Sharma, B. Lall
arXiv:1701.07046, 2017
PDF       arXiv
Abstract: We present a method for discovering objects in 3D point clouds from sensors like Microsoft Kinect. We utilize supervoxels generated directly from the point cloud data and design a Siamese network building on a recently proposed 3D convolutional neural network architecture. At training, we assume the availability of the some known objects---these are used to train a non-linear embedding of supervoxels using the Siamese network, by optimizing the criteria that supervoxels which fall on the same object should be closer than those which fall on different objects, in the embedding space. We do not assume the objects during test to be known, and perform clustering, in the embedding space learned, of supervoxels to effectively perform novel object discovery. We validate the method with quantitative results showing that it can discover numerous unseen objects while being trained on only a few dense 3D models. We also show convincing qualitative results of object discovery in point cloud data when the test objects, either specific instances or even their categories, were never seen during training.
@article{objDisc2017,
            title={Large Scale Novel Object Discovery in 3D},
            author={Siddharth Srivastava and Gaurav Sharma and Brejesh Lall}
            journal={arXiv:1701.07046},
            year={2017}
}


Fast Localization of Autonomous Vehicles using Discriminative Metric Learning
A. Pensia, G. Sharma, J. McBride, G. Pandey
Conference on Computer and Robot Vision (CRV)
Alberta, Canada, May 2017 (to appear)
PDF (soon)


2016

Deep Fusion of Visual Signatures for Client-Server Facial Analysis
(Best Paper Award Runners Up)
B. Bhattarai, G. Sharma, F. Jurie
Indian Conference on Vision Graphics and Image Processing (ICVGIP)
Guwahati, India, December 2016
PDF       arXiv  
Abstract: Facial analysis is a key technology for enabling human-machine interaction. In this context, we present a client-server framework, where a client transmits the signature of a face to be analyzed to the server, and, in return, the server sends back various information describing the face e.g. is the person male or female, is she/he bald, does he have a mustache, etc. We assume that a client can compute one (or a combination) of visual features; from very simple and efficient features, like Local Binary Patterns, to more complex and computationally heavy, like Fisher Vectors and CNN based, depending on the computing resources available. The challenge addressed in this paper is to design a common universal representation such that a single merged signature is transmitted to the server, whatever be the type and number of features computed by the client, ensuring nonetheless an optimal performance. Our solution is based on learning of a common optimal subspace for aligning the different face features and merging them into a universal signature. We have validated the proposed method on the challenging CelebA dataset, on which our method outperforms existing state-of-the-art methods when rich representation is available at test time, while giving competitive performance when only simple signatures (like LBP) are available at test time due to resource constraints on the client.
@inproceedings{deepfuse2016,
            title={Deep Fusion of Visual Signatures for Client-Server Facial Analysis},
            author={Binod Bhattarai and Gaurav Sharma and Frederic Jurie}
            booktitle={Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP)},
            year={2016}
}


Discriminatively Trained Latent Ordinal Model for Video Classification
K. Sikka, G. Sharma
arXiv:1608.02318, 2016
PDF       arXiv   Project page  
Abstract: We study the problem of video classification for facial analyis and human action recognition. We propose a novel weakly supervised learning method that models the video as a sequence of automatically mined, discriminative sub-events (e.g. onset and offset phase for smile, running and jumping for high-jump). The proposed model is inspired by the recent works on Multiple Instance Learning and latent SVM/HCRF – it extends such frameworks to model the ordinal aspect in the videos, approximately. We obtain consistent improvements over relevant competitive baselines on four challenging and publicly available video based facial analysis datasets for prediction of expression, clinical pain and intent in dyadic conversations and on three challenging human action datasets. We also validate the method with qualitative results and show that they largely support the intuitions behind the method.
@article{sikka2016lomo,
            title={Discriminatively Trained Latent Ordinal Model for Video Classification},
            author={Karan Sikka and Gaurav Sharma}
            journal={arXiv:1608.02318},
            year={2016}
}


CP-mtML: Coupled Projection multi-task Metric Learning for Large Scale Face Retrieval
B. Bhattarai, G. Sharma, F. Jurie
Computer Vision and Pattern Recognition (CVPR)
Las Vegas, NV, USA, June 2016
PDF       arXiv
Abstract: We propose a novel Coupled Projection multi-task Metric Learning (CP-mtML) method for large scale face retrieval. In contrast to previous works which were limited to low dimensional features and small datasets, the proposed method scales to large datasets with high dimensional face descriptors. It utilises pairwise (dis-) similarity constraints as supervision and hence does not require exhaustive class annotation for every training image. While, traditionally, multi-task learning methods have been validated on same dataset but different tasks, we work on the more challenging setting with heterogeneous datasets and different tasks. We show empirical validation on multiple face image datasets of different facial traits, e.g. identity, age and expression. We use classic Local Binary Pattern (LBP) descriptors along with the recent Deep Convolutional Neural Network (CNN) features. The experiments clearly demonstrate the scalability and improved performance of the proposed method on the tasks of identity and age based face image retrieval compared to competitive existing methods, on the standard datasets and with the presence of a million distractor face images.
@inproceedings{mtml_cvpr_2016,
            title={{CP-mtML}: {C}oupled Projection multi-task Metric Learning for Large Scale Face Retrieval },
            author={Binod Bhattarai and Gaurav Sharma and Frederic Jurie},
            booktitle={CVPR},
            year={2016}
}


LOMo: Latent Ordinal Model for Facial Analysis in Videos
(Spotlight presentation)
K. Sikka, G. Sharma, M. Bartlett
Computer Vision and Pattern Recognition (CVPR)
Las Vegas, NV, USA, June 2016
PDF       arXiv
Abstract: We study the problem of facial analysis in videos. We propose a novel weakly supervised learning method that models the video event (expression, pain etc.) as a sequence of automatically mined, discriminative sub-events (eg. onset and offset phase for smile, brow lower and cheek raise for pain). The proposed model is inspired by the recent works on Multiple Instance Learning and latent SVM/HCRF- it extends such frameworks to model the ordinal or temporal aspect in the videos, approximately. We obtain consistent improvements over relevant competitive baselines on four challenging and publicly available video based facial analysis datasets for prediction of expression, clinical pain and intent in dyadic conversations. In combination with complimentary features, we report state-of-the-art results on these datasets.
@inproceedings{lomo_cvpr_2016,
            title = {{LOMo}: Latent Ordinal Model for Facial Analysis in Videos},
            author = {Karan Sikka and Gaurav Sharma and Marian Bartlett },
            booktitle={CVPR},
            year={2016}
}


Latent Embeddings for Zero-shot Classification
(Spotlight presentation)
Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, B. Schiele
Computer Vision and Pattern Recognition (CVPR)
Las Vegas, NV, USA, June 2016
PDF       arXiv
Abstract: We present a novel latent embedding model for learning a compatibility function between image and class embeddings, in the context of zero-shot classification. The proposed method augments the state-of-the-art bilinear compatibility model by incorporating latent variables. Instead of learning a single bilinear map, it learns a collection of maps with the selection, of which map to use, being a latent variable for the current image-class pair. We train the model with a ranking based objective function which penalizes incorrect rankings of the true class for a given image. We empirically demonstrate that our model improves the state-of-the-art for various class embeddings consistently on three challenging publicly available datasets for the zero-shot setting. Moreover, our method leads to visually highly interpretable results with clear clusters of different fine-grained object properties that correspond to different latent variable maps.
@inproceedings{latem_cvpr_2016,
            title = {Latent Embeddings for Zero-shot Classification},
            author = {Yongqin Xian and Zeynep Akata and Gaurav Sharma and Quynh Nguyen and Matthias Hein and Bernt Schiele},
            booktitle={CVPR},
            year={2016}
}


Local Higher-Order Statistics (LHS) Describing Images with Statistics of Local Non-binarized Pixel Patterns
G. Sharma, F. Jurie
Computer Vision and Image Understanding (CVIU), 142:13-22, 2016
PDF       HAL
Abstract: We propose a new image representation for texture categorization and facial analysis, relying on the use of higher-order local differential statistics as features. It has been recently shown that small local pixel pattern distributions can be highly discriminative while being extremely efficient to compute, which is in contrast to the models based on the global structure of images. Motivated by such works, we propose to use higher-order statistics of local non-binarized pixel patterns for the image description. The proposed model does not require either (i) user specified quantization of the space (of pixel patterns) or (ii) any heuristics for discarding low occupancy volumes of the space. We propose to use a data driven soft quantization of the space, with parametric mixture models, combined with higher-order statistics, based on Fisher scores. We demonstrate that this leads to a more expressive representation which, when combined with discriminatively learned classifiers and metrics, achieves state-of-the-art performance on challenging texture and facial analysis datasets, in low complexity setup. Further, it is complementary to higher complexity features and when combined with them improves performance.
@article{lhs_cviu_2016,
            title = {Local Higher-Order Statistics ({LHS}) describing images with statistics of local non-binarized pixel patterns},
            author = {G. Sharma and F. Jurie},
            journal = {Computer Vision and Image Understanding (CVIU)},
            volume = {142},
            pages = {13--22},
            year = 2016
}


A Joint Learning Approach for Cross Domain Age Estimation
(Best Student Paper Award of Image, Video and Multidimensional Singal Processing)
B. Bhattarai, G. Sharma, A. Lechervy, F. Jurie
International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Shanghai, China, Mar 2016
PDF      
Abstract: We propose a novel joint learning method for cross domain age estimation, a domain adaptation problem. The proposed method learns a low dimensional projection along with a re- gressor, in the projection space, in a joint framework. The projection aligns the features from two different domains, i.e. source and target, to the same space, while the regressor pre- dicts the age from the domain aligned features. After this alignment, a regressor trained with only a few examples from the target domain, along with more examples from the source domain, can predict very well the ages of the target domain face images. We provide empirical validation on the largest publicly available dataset for age estimation i.e. MORPH- II. The proposed method improves performance over several strong baselines and the current state-of-the-art methods.
@inproceedings{jl_icassp_2016,
            title = {A Joint Learning Approach for Cross Domain Age Estimation},
            author = {B. Bhattarai and G. Sharma and A. Lechervy and F. Jurie},
            year={2016},
            booktitle={ICASSP}
}


2015

Scalable Nonlinear Embeddings for Semantic Category-based Image Retrieval
G. Sharma, B. Schiele
International Conference on Computer Vision (ICCV)
Santiago, Chile, Dec 2015
PDF      
Abstract: We propose a novel algorithm for the task of supervised discriminative distance learning by nonlinearly embedding vectors into a low dimensional Euclidean space. We work in the challenging setting where supervision is with constraints on similar and dissimilar pairs while training. The proposed method is derived by an approximate kernelization of a linear Mahalanobis-like distance metric learning algorithm and can also be seen as a kernel neural network. The number of model parameters and test time evaluation complexity of the proposed method are O(dD) where D is the dimensionality of the input features and d is the dimension of the projection space - this is in contrast to the usual kernelization methods as, unlike them, the complexity does not scale linearly with the number of training examples. We propose a stochastic gradient based learning algorithm which makes the method scalable (w.r.t. the number of training examples), while being nonlinear. We train the method with up to half a million training pairs of 4096 dimensional CNN features. We give empirical comparisons with relevant baselines on seven challenging datasets for the task of low dimensional semantic category based image retrieval.
@inproceedings{nml_iccv_2016,
            title = {Scalable Nonlinear Embeddings for Semantic Category-based Image Retrieval},
            author = {Gaurav Sharma and Bernt Schiele},
            year={2015},
            booktitle={ICCV}
}


Latent Max-margin Metric Learning for Comparing Video Face Tubes
(Best paper award)
G. Sharma, P. Perez
Workshop on Biometrics
Computer Vision and Pattern Recognition (CVPR)

Boston, MA, USA, June 2015
PDF      
Abstract: Comparing "face tubes" is a key component of modern systems for face biometrics based video analysis and annotation. We present a novel algorithm to learn a distance metric between such spatio-temporal face tubes in videos. The main novelty in the algorithm is based on incorporation of latent variables in a max-margin metric learning framework. The latent formulation allows us to model, and learn metrics to compare faces under different challenging variations in pose, expressions and lighting. We propose a novel dataset named TV Series Face Tubes (TSFT) for evaluating the task. The dataset is collected from 12 different episodes of 8 popular TV series and has 94 subjects with 569 manually annotated face tracks in total. We show quantitatively how incorporating latent variables in max-margin metric learning leads to improvement of current state-of-the-art metric learning methods for the two cases when the testing is done with subjects that were seen during training and when the test subjects were not seen at all during training. We also give results on a challenging benchmark dataset: YouTube faces, and place our algorithm in context w.r.t. existing methods.
@inproceedings{lftc_cvprw15,
            title = {Latent Max-margin Metric Learning for Comparing Video Face Tubes},
            author = {Gaurav Sharma and Patrick Perez},
            year={2015},
            booktitle={CVPRW}
}


2014

EPML: Expanded Parts based Metric Learning for Occlusion Robust Face Verification
G. Sharma, F. Jurie, P. Perez
Asian Conference on Computer Vision (ACCV)
Singapore, Nov 2014
PDF      
Abstract: We propose a novel Expanded Part-based Metric Learning (EPML) model for face verification. The model is capable of mining out the discriminative regions at the right locations and scales, for identity based matching of face images. It performs well in the presence of occlusions, by avoiding the occluded regions and selecting the next best visible regions. We show quantitatively, by experiments on the standard benchmark dataset Labeled Faces in the Wild (LFW), that the model works much better than the traditional method of face representation with metric learning, both (i) in the presence of heavy random occlusions and, (ii) also, in the case of focussed occlusions of discriminative face regions such as eyes or mouth. Further, we present qualitative results which demonstrate that the method is capable of ignoring the occluded regions while exploiting the visible ones.
@inproceedings{epml_acccv14,
            title = { {EPML}: {E}xpanded Parts based Metric Learning for Occlusion Robust Face Verification},
            author = {Gaurav Sharma and Frederic Jurie and Patrick Perez},
            year={2014},
            booktitle={ACCV}
}


Learning Nonlinear SVM in Input Space for Image Classification
G. Sharma, F. Jurie, P. Perez
HAL Technical report, hal-00977304
Rennes, France, 2014
PDF       HAL  
Abstract: The kernel trick enables learning of non-linear decision functions without having to explicitly map the original data to a high dimensional space. However, at test time, it requires evaluating the kernel with each one of the support vectors, which is time consuming. We propose a novel approach for learning non- linear support vector machine (SVM) corresponding to commonly used kernels in computer vision, namely (i) Histogram Intersection, (ii) chi-squared, (ii) Radial Basis Function (RBF) and (iv) RBF with chi-squared distance, without us- ing the kernel trick. The proposed classifier incorporates non-linearity while maintaining O(D) testing complexity (for D-dimensional space), compared to O(D × Nsv ) (for Nsv number of support vectors) when using the kernel trick. We also promote the idea that such efficient non-linear classifier, combined with simple image encodings, is a promising direction for image classification. We validate the proposed method with experi- ments on four challenging image classification datasets. It achieves similar performance w.r.t. kernel SVM and recent explicit feature mapping method while being significantly faster and memory efficient. It obtains competitive performance while being an order of magnitude faster than the state-of-the-art Fisher Vector method and, when combined with it, consistently improves performance with a very small additional computation cost.
@article{nsvm_hal14,
            title = {Learning Non-linear SVM in Input Space for Image Classification},
            author = {Gaurav Sharma and Frederic Jurie and Patrick Perez},
            year={2014},
            journal={HAL Technical Report hal-00977304}
}


Some Faces are More Equal than Others: Hierarchical Organization
for Accurate and Efficient Large-scale Identity-based Face Retrieval

B. Bhattarai, G. Sharma, F. Jurie, P. Perez
European Conference on Computer Vision (ECCV) Workshops
Zurich, Switzerland, Sep 2014
PDF      
Abstract: This paper presents a novel method for hierarchically organizing large face databases, with application to efficient identity-based face retrieval. The method relies on metric learning with local binary pattern (LBP) features. One one hand, LBP features have proved to be highly resilient to various appearance changes due to illumination and contrast variations while being extremely efficient to calculate. On the other hand, metric learning (ML) approaches have been proved very successful for face verification 'in the wild', i.e. in uncontrolled face images with large amounts of variations in pose, expression, appearances, lighting, etc. While such ML based approaches compress high dimensional features into low dimensional spaces using discriminatively learned projections, the complexity of retrieval is still significant for large scale databases (with millions of faces). The present paper shows that learning such discriminative projections locally while organizing the database hierarchically leads to a more accurate and efficient system. The proposed method is validated on the standard Labeled Faces in the Wild (LFW) benchmark dataset with millions of additional distracting face images collected from photos on the internet.
@inproceedings{hfaces_eccvw14,
            title = {Some faces are more equal than others:
                        {H}ierarchical organization for accurate and efficient large-scale identity-based face retrieval},
            author = {Binod Bhattarai and Gaurav Sharma and Frederic Jurie and Patrick Perez},
            year={2014},
            booktitle={ECCVW}
}


Transfer Learning via Attributes for Improved On-the-fly Classification
P. Kulkarni, G. Sharma, J. Zepeda, L. Chevallier
IEEE Winter Conference on Applications of Computer Vision (WACV)
Colorado, USA, Mar 2014
PDF       HAL  
Abstract: Retrieving images for an arbitrary user query, provided in textual form, is a challenging problem. A recently pro- posed method addresses this by constructing a visual classifier with images returned by an internet image search engine, based on the user query, as positive images while using a fixed pool of negative images. However, in practice, not all the images obtained from internet image search are always pertinent to the query; some might contain abstract or artistic representation of the content and some might have artifacts. Such images degrade the performance of on-the-fly constructed classifier.
        We propose a method for improving the performance of on-the-fly classifiers by using transfer learning via attributes. We first map the textual query to a set of known attributes and then use those attributes to prune the set of images downloaded from the internet. This pruning step can be seen as zero-shot learning of the visual classifier for the textual user query, which transfers knowledge from the attribute domain to the query domain. We also use the attributes along with the on-the-fly classifier to score the database images and obtain a hybrid ranking. We show interesting qualitative results and demonstrate by experiments with standard datasets that the proposed method improves upon the baseline on-the-fly classification system.
@inproceedings{tl_wacv14,
            title = {Transfer Learning via Attributes for Improved On-the-fly Classification},
            author = {Praveen Kulkarni and Gaurav Sharma and Joaquin Zepeda and Louis Chevallier},
            year={2014},
            booktitle={WACV}
}


2013

A Novel Approach for Efficient SVM Classification with Histogram Intersection Kernel
(Oral presentation; 7% acceptance rate)
G. Sharma, F. Jurie
British Machine Vision Conference (BMVC)
Bristol, UK, Sep 2013
PDF       Presentation Video  
Abstract: The kernel trick - commonly used in machine learning and computer vision – enables learning of non-linear decision functions without having to explicitly map the original data to a high dimensional space. However, at test time, it requires evaluating the kernel with each one of the support vectors, which is time consuming. In this paper, we proposea novel approach for learning non-linear SVM corresponding to the histogram intersection kernel without using the kernel trick. We formulate the exact non-linear problem in the original space and show how to perform classi?cation directly in this space. The learnt classi?er incorporates non-linearity while maintaining O(d) testing complexity (for d-dimensional input space), compared to O(d x Nsv) when using the kernel trick. We show that the SVM problem with histogram intersection kernel is quasi-convex in input space and outline an iterative algorithm to solve it. The proposed approach has been validated in experiments where it is compared with other linear SVM-based methods, showing that the proposed method achieves similar or better performance at lower computational and memory costs.
@inproceedings{nsvm_bmvc13,
            title = {A Novel Approach for Efficient SVM Classification with Histogram Intersection Kernel},
            author = {Gaurav Sharma and Frederic Jurie},
            year={2013},
            booktitle={BMVC}
}


Expanded Parts Model for Human Attribute and Action Recognition in Still Images
G. Sharma, F. Jurie, C. Schmid
Computer Vision and Pattern Recognition (CVPR)
Oregon, USA, June 2013
PDF      
Abstract: We propose a new model for recognizing human attributes (e.g. wearing a suit, sitting, short hair) and actions (e.g. running, riding a horse) in still images. The proposed model relies on a collection of part templates which are learnt discriminatively to explain specific scale-space locations in the images (in human centric coordinates). It avoids the limitations of highly structured models, which consist of a few (i.e. a mixture of) ‘average’ templates. To learn our model, we propose an algorithm which automatically mines out parts and learns corresponding discriminative templates with their respective locations from a large number of candidate parts. We validate the method on recent challenging datasets: (i) Willow 7 actions [7], (ii) 27 Human Attributes (HAT) [25], and (iii) Stanford 40 actions [37]. We obtain convincing qualitative and state-of-the-art quantitative results on the three datasets.
@inproceedings{epm_cvpr12,
            title = {Expanded Parts Model for Human Attribute and Action Recognition in Still Images},
            author = {Gaurav Sharma and Frederic Jurie and Cordelia Schmid},
            year={2012},
            booktitle={CVPR}
}


2012

Semantic Description of Humans in Images
G. Sharma
PhD thesis, GREYC - Université de Caen and LEAR - INRIA Grenoble
Caen, France, December 2012
PDF      
Abstract: In the present thesis we are interested in semantic description of humans in images. We propose to describe humans with the help of (i) semantic attributes e.g. male or female, wearing a tee-shirt, (ii) actions e.g. riding a horse, running and (iii) facial expressions e.g. smiling, angry.

First, we propose a new image representation to better exploit the class specific spatial information. The standard representation i.e. spatial pyramids, has two shortcomings. It assumes that the distribution of spatial information (i) is uniform and (ii) is same for all tasks. We address these shortcomings by learning the discriminative spatial informa- tion for a specific task. Further, we propose a model that adapts the spatial information for each image for a given task. This lends more flexibility to the model and allows for misalignments of discriminative regions e.g. the legs may be at different positions, in different images for running class. Finally, we propose a new descriptor for facial expres- sion analysis. We work in the space of intensity differences of local pixel neighborhoods and propose to learn the quantization of the space and use higher order statistics of the difference vector to obtain more expressive descriptors.

We introduce a challenging dataset of human attributes containing 9344 human images, sourced from the internet, with annotations for 27 semantic attributes based on sex, pose, age and appearance/clothing. We validate the proposed methods on our dataset of human attributes as well as on publicly available datasets of human actions, fine grained classification involving human actions and facial expressions. We also report results on related computer vision datasets, for scene recognition, object image classification and texture categorization, to highlight the generality of our contributions.
@phdthesis{sharma_thesis2012,
            title = {Semantic Description of Humans in Images},
            author = {Gaurav Sharma},
            year={2012},
            school = {LEAR -- INRIA, GREYC -- CNRS}
}


Local Higher-Order Statistics (LHS) for Texture Categorization and Facial Analysis
G. Sharma, S. ul Hussain, F. Jurie
European Conference on Computer Vision (ECCV)
Florence, Italy, October 2012
PDF      
Abstract: This paper proposes a new image representation for texture categorization and facial analysis, relying on the use of higher-order lo- cal differential statistics as features. In contrast with models based on the global structure of textures and faces, it has been shown recently that small local pixel pattern distributions can be highly discrimina- tive. Motivated by such works, the proposed model employs higher-order statistics of local non-binarized pixel patterns for the image description. Hence, in addition to being remarkably simple, it requires neither any user specified quantization of the space (of pixel patterns) nor any heuris- tics for discarding low occupancy volumes of the space. This leads to a more expressive representation which, when combined with discrimina- tive SVM classifier, consistently achieves state-of-the-art performance on challenging texture and facial analysis datasets outperforming contem- porary methods (with similar powerful classifiers).
@inproceedings{sharma_eccv2012,
            title = {Local Higher-Order Statistics ({LHS}) for Texture Categorization and Facial Analysis},
            author = {Gaurav Sharma and Sibt ul Hussain and Frederic Jurie},
            year={2012},
            booktitle = {ECCV}
}


Discriminative Spatial Saliency for Image Classification
G. Sharma, F. Jurie, C. Schmid
Computer Vision and Pattern Recognition (CVPR)
Rhode Island, USA, June 2012
PDF      
Abstract: In many visual classification tasks the spatial distribu- tion of discriminative information is (i) non uniform e.g. person reading can be distinguished from taking a photo based on the area around the arms i.e. ignoring the legs and (ii) has intra class variations e.g. different readers may hold the books differently. Motivated by these observations, we propose to learn the discriminative spatial saliency of im- ages while simultaneously learning a max margin classifier for a given visual classification task. Using the saliency maps to weight the corresponding visual features improves the discriminative power of the image representation. We treat the saliency maps as latent variables and allow them to adapt to the image content to maximize the classification score, while regularizing the change in the saliency maps. Our experimental results on three challenging datasets, for (i) human action classification, (ii) fine grained classifica- tion and (iii) scene classification, demonstrate the effective- ness and wide applicability of the method.
@inproceedings{sharma_cvpr2012,
            title = {Discriminative Spatial Saliency for Image Classification},
            author = {Gaurav Sharma and Sibt ul Hussain and Frederic Jurie},
            year={2012},
            booktitle = {CVPR}
}


2011

Learning Discriminative Spatial Representation for Image Classification
(Oral presentation; 8% acceptance rate)
G. Sharma, F. Jurie
British Machine Vision Conference (BMVC)
Dundee, UK, Sep 2011
PDF       Dataset of Human Attributes (HATDB)   Presentation Video  
Abstract: Spatial Pyramid Representation (SPR) [7] introduces spatial layout information to the orderless bag-of-features (BoF) representation. SPR has become the standard and has been shown to perform competitively against more complex methods for incorporating spatial layout. In SPR the image is divided into regular grids. However, the grids are taken as uniform spatial partitions without any theoretical motivation. In this paper, we address this issue and propose to learn the spatial partitioning with the BoF representation. We define a space of grids where each grid is obtained by a series of recursive axis aligned splits of cells. We cast the classification problem in a maximum margin formulation with the optimization being over the weight vector and the spatial grid. In addition to experiments on two challenging public datasets (Scene-15 and Pascal VOC 2007) showing that the learnt grids consistently perform better than the SPR while be- ing much smaller in vector length, we also introduce a new dataset of human attributes and show that the current method is well suited to the recognition of spatially localized human attributes.
@inproceedings{dsr_bmvc2011,
            title = {Learning Discriminative Spatial Representation for Image Classification},
            author = {Gaurav Sharma and Frederic Jurie},
            year={2011},
            booktitle = {BMVC}
}


2010

Distributed Calibration of Pan-Tilt Camera Network using Multi-Layered Belief Propagation
A. Choudhary, G. Sharma, S. Chaudhury, S. Banerjee
Workshop on Camera Networks
Computer Vision and Pattern Recognition (CVPR)

San Francisco, California, USA, June 2010
PDF  


2009

Adaptive Digital Makeup
A. Dhall, G. Sharma, R. Bhatt, G. M. Khan
5th International Symposium on Visual Computing (ISVC)
Las Vegas, Nevada, USA, November 2009
PDF  


Hierarchical System for Categorization and Orientation Detection of Consumer Images
G. Sharma, A. Dhall, S. Chaudhury, R. Bhatt
International Conference on Pattern Recognition and Machine Intelligence (PReMI)
Delhi, India, December 2009
PDF  


Curvature Feature Distribution based Classification of Indian Scripts from Document Images
G. Sharma, R. Garg, S. Chaudhury
Workshop on Multilingual OCR
International Conference on Document Analysis and Recognition (ICDAR)

Barcelona, Spain, July 2009
PDF  


Object Detection as Statistical Test of Hypothesis
G. Sharma, S. Chaudhury, J. B. Srivastava
Indian Conference on Vision Graphics and Image Processing (ICVGIP)
Orissa, India, December 2009


2008

Kernel Eigen Space Merging
G. Sharma, S. Chaudhury, J. B. Srivastava
Technical report for Masters Thesis, Indian Institute of Technology Delhi (IITD)
Delhi, India, July 2008
PDF  


Bag-of-Features Kernel Eigen Spaces for Classification
G. Sharma, S. Chaudhury, J. B. Srivastava
International Conference on Pattern Recognition (ICPR)
Tampa, Florida, USA, December 2008


2007

Text Mining through Entity-Relationship Based Information Extraction
L. Dey, M. Abulaish, Mr. Jahiruddin and G. Sharma
Workshop on Bio-Medical Application of Web Technology
IEEE/WIC/ACM International Conference on Web Intelligence and Intelligen Agent Technology

Silicon Valley, California, USA, 2007
PDF