Recent Collaborators

ASU/CSE Awards

Top 5% Faculty Award, Fulton Schools of Engineering, 2014
Top 5% Faculty Award, Fulton Schools of Engineering, 2012

Best Teacher Award, Fulton Schools of Engineering, 2013
Best Teacher Award, Fulton Schools of Engineering, 2013

Top 5% Faculty Award, Fulton Schools of Engineering, 2012
Top 5% Faculty Award, Fulton Schools of Engineering, 2012

Distinguished Service'09
Distinguished Service in Computer Science and Engineering Award, 2009

Researcher of the Year'08
Researcher of the Year Award,2008

Service Faculty'07
Service Faculty of the Year Award, 2007

RanKloud (NSF)

Today data is produced in massive quantities. The applications that drive this massive data influx span a large spectrum from entertainment, surveillance, e-commerce, to web and social media. This data flood brings forth a need for highly parallelizable frameworks for scalable processing and efficient analysis of large media collections. In practice, given an atomic task, one can partition the data and assign the work onto multiple machines in such a way that all machines do similar work on different data.

The observation that a significant class of data processing applications can be described in terms of a small set of primitives that are easy to describe and, in many cases, easy to parallelize, has led to frameworks (such as MapReduce-based Hadoop, Dynamo, Scope, and PNUTS) and languages (such as Pig Latin) that target the needs of on-transactional, data intensive applications. These systems achieve high degrees of scalability by carefully allocating resources to available processing elements and leveraging any opportunities to parallelize basic processing tasks. In this work, we note that when dealing with analysis workflows, we need to consider data and feature utility when partitioning the data and the work on the available servers. In most applications (such as multimedia analysis), the utility of a data object or a given feature to the particular search or analysis task varies with many factors, including the way the data is collected (e.g. its precision), its relevance to the user’s goals, or how discriminating the object or its features are. Thus, we argue that under these conditions, achieving better scalability requires leveraging the utility assessments of the data elements and their features for regulating intelligently the qualified data that need to be processed by appropriate processing units. Relying on this observation, we propose RanKloud, a scalable ranked media query processing system: RanKloud avoids waste by intelligently partitioning the data and allocating them on available resources in a way that minimizes the data replication and indexing overheads as well as prunes superfluous low-utility processing.

  • RanKloud parallelizes the ranked processing operations (such as top-k joins, skylines, and nearest neighbor joins) by building on the MapReduce paradigm. Implementing these operations over a system that partitions the data and processes the partitions in batches, however, requires care: in particular, often naive partitioning of the data onto servers will waste resources: (a) this will increase the data replication cost, (b) the cost of indexing and accessing these index structures will be unnecessarily high, and (c) the system will use time and resources for producing large number of (local) candidates that will eventually be eliminated during the final result integration stage due to their lower (global) utilities. Thus, RanKloud research includes (a) adaptable, rank-aware data processing (map, reduce, and merge) primitives; (b) waste- and unbalance avoidance strategies for utility-aware data partitioning, resource allocation, and for incremental batched processing; and (c) strategies for adaptation of the media processing workflow scheduling based on the data- and utility characteristics discovered in run-time.

    • K. Selcuk Candan. RanKloud: Scalable Multimedia and Social Media Retrieval and Analysis in the Cloud. Invited talk at the Symposium on "Next Generation Multimedia Research and Development", NYU Abu Dhabi Institute, May 2-3, 2012.
    • K. Selcuk Candan, Jong Wook Kim, Parth Nagarkar, Mithila Nagendra, and Renwei Yu. RanKloud: Scalable Multimedia Data Processing in Server Clusters. IEEE Multimedia, Special Issue on Large-Scale Multimedia Retrieval and Mining, 18(1), pp. 64-77, 2011.
    • K. Selcuk Candan, Parth Nagarkar, Mithila Nagendra, and Renwei Yu. RanKloud: A Scalable Ranked Query Processing Framework on Hadoop, demonstration at the 14th International Conference on Extending Database Technology (EDBT), March 22-24, 2011.
    • K. Selcuk Candan. RanKloud: Scalable Multimedia and Social Media Retrieval and Analysis in the Cloud. Keynote talk abstract, Proceedings of the International Workshop on Large-Scale and Distributed Systems for Information Retrieval (LSDS-IR'11), 2011.
    • Renwei Yu, Mithila Nagendra, Parth Nagarkar, K. Selcuk Candan, and Jong Wook Kim. Data-Utility Sensitive Query Processing on Server Clusters to Support Scalable Data Analysis Services."New Frontiers in Information and Software as Services : Service and Application Design Challenges in the Cloud", Series: Lecture Notes in Business Information Processing, Vol. 74, pp.155-184, 2011.

  • Tensors (multi-dimensional arrays) are widely used for representing such high-order dimensional data. Consequently, a system dealing with social media data needs to scale with the tensor volume and the number and diversity of the data facets. This necessitates highly parallelizable, and in many cases cloud-based, frameworks for scalable processing and efficient analysis of large media and social media collections. RanKloud also addresses the computational cost of various multi-dimensional data analysis operations, including decompositions and structural change detection, by (a) leveraging a priori background knowledge (or metadata) about one or more domain dimensions and (b) by extending compressed sensing (CS) to tensor data to encode the observed tensor streams in the form of compact descriptors.
    • As part of RanKloud research, we developed SCENT, an innovative, scalable spectral analysis framework for internet scale monitoring of multirelational social media data, encoded in the form of tensor streams. In SCENT, we focused on the computational cost of structural change detection in tensor streams and extended compressed sensing (CS) to tensor data. We showed that, through the use of randomized tensor ensembles, SCENT is able to encode the observed tensor streams in the form of compact descriptors. We showed that the descriptors allow very fast detection of significant spectral changes in the tensor stream, which also reduce data collection, storage, and processing costs.
      • Yu-Ru Lin, K. Selcuk Candan, Hari Sundaram, Lexing Xie. SCENT: Scalable Compressed Monitoring of Evolving Multi-Relational Social Networks. ACM Transactions on Multimedia Computing, Communications and Applications (ACM TOMCCAP), special issue on "Social Media", 7S, 1, Article 29, November 2011.

    • In traditional matrix and tensor analysis operations, the only basis for the task is a given matrix or tensor, describing the strengths of the relationships between pairs of elements in the different domains. In many real life applications, on the other hand, other background knowledge or metadata about one or more of the two input domain dimensions may be available and, if leveraged properly, such metadata might play a significant role in the effectiveness of the co-clustering process. How additional metadata affects analysis, however, depends on how the process is modified to be context-aware. We proposed, compared, and evaluated different strategies for embedding available contextual knowledge into the analysis process. Experimental results showed that it is possible to leverage the available metadata to improve the quality of analysis without significant overheads in execution cost. Alternatively, metadata can help reduce the execution time of the analysis, without significant impacts on results quality
      • Claudio Schifanella, Maria Luisa Sapino, and K. Selcuk Candan. On Context-Aware Co-Clustering with Metadata Support. Journal of Intelligent Information Systems, 38(1): 209-239, 2012.
      • Claudio Schifanella, K. Selcuk Candan, and Maria Luisa Sapino. Metadata-Driven Multiresolution Approach to Tensor Decomposition. to appear at the ACM Conference on Information and Knowledge Management (CIKM), 1275-1284, 2011.

    • For many multi-dimensional data applications, tensor operations as well as relational operations need to be supported throughout the data lifecycle. Since high-dimensional tensor decomposition is expensive, we proposed algorithms that would involve first partitioning the data into the smaller data, for example based on the functional dependencies of the relation, and then performing the decompositions of these smaller tensors. These algorithms fit naturally into multiple cores, leading to highly efficient, effective, and parallelized algorithms under both dense and sparse tensors. We are also developing an in-database analytic system for efficient implementations of in-database tensor decompositions on chunk-based array data stores, so called, TensorDB. TensorDB includes static in-database tensor decomposition and dynamic in-database tensor decomposition operators. TensorDB extends an array database and leverages array operations for data manipulation and integration. TensorDB supports complex data processing plans where multiple relational algebraic and tensor algebraic operations are composed with each other. The in-database version of the TensorDB has been released as an open-source software at GitHub .
      • Claudio Schifanella, K. Selçuk Candan, and Maria Luisa Sapino. 2014. Multiresolution Tensor Decompositions with Mode Hierarchies. ACM Trans. Knowl. Discov. Data 8, 2, Article 10 (June 2014), 38 pages. DOI=10.1145/2532169
      • Mijung Kim. TensorDB and Tensor-Relational Model (TRM) for Efficient Tensor-Relational Operations. (2014). PhD Thesis. Arizona State University.
      • Mijung Kim and K. Selcuk Candan (2014). Pushing-Down Tensor Decompositions over Unions to Promote Reuse of Materialized Decompositions. The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD).
      • Mijung Kim and K. Selcuk Candan. SBV-Cut: Vertex-Cut based Graph Partitioning using Structural Balance Vertices. Data & Knowledge Engineering, 72: 285-303, 2012.
      • Mijung Kim and K. Selcuk Candan. Decomposition-by-normalization (DBN): leveraging approximate functional dependencies for efficient tensor decomposition. CIKM 2012: 355-364
      • Mijung Kim and K. Selcuk Candan. Approximate Data Analysis within a Tensor-Relation Algebraic Framework. to appear at the ACM Conference on Information and Knowledge Management (CIKM), 1737-1742, 2011.

    • Efficient processing of preference/skyline queries is a key challenge in RanKloud. Most existing static and streaming techniques assume that the skyline query is applied to a single data source. Unfortunately, this is not true in many applications in which, due to the complexity of the schema, the skyline query may involve attributes belonging to multiple data sources. Recently, in the context of static environments, various hybrid skyline-join algorithms have been proposed. However, these algorithms suffer from several drawbacks: they often need to scan the data sources exhaustively in order to obtain the set of skyline-join results; moreover, the pruning techniques employed to eliminate the tuples are largely based on expensive pairwise tuple-to-tuple comparisons. On the other hand, most existing streaming methods focus on single stream skyline analysis, thus rendering these techniques unsuitable for applications that require a real-time “join” operation to be carried out before the skyline query can be answered. Based on these observations, we are developing the SkySuite framework of skyline-join operators that can be leveraged to efficiently process skyline-join queries over both static and stream environments. Among others, SkySuite includes (1) a novel Skyline-Sensitive Join (SSJ) operator that effectively processes skyline-join queries in static environments, and (2) a Layered Skyline-window-Join (LSJ) operator that incrementally maintains skyline-join results over stream environments.
      • Mithila Nagendra and K. Selcuk Candan. Layered Processing of Skyline-Window-Join (SWJ) Queries using Iteration-Fabric. ICDE'13, pp. 985-996, 2013.
      • Mithila Nagendra and K. Selcuk Candan. SkySuite: A Framework of Skyline Join Operators for Static and Stream Environments. Demonstration at VLDB'13, Proceedings of the VLDB Endowment (PVLDB). 6 (12), 2013.
      • Mithila Nagendra and K. Selcuk Candan. Skyline-Sensitive Joins with LR-Pruning. 15th International Conference on Extending Database Technology (EDBT), March, 2012.
      • Giacomo Cappellari. Parallel Iteration Fabric: Efficient Parallel Skyline-Window-Join Computation. Tesi di Laurea Magistrale, Politecnico di Torino, Italy, 2013.

    • When data are large and query processing workloads consist of data selection and aggregation operations (as in online analytical processing), column-oriented data stores are generally the preferred choice of data organization, because they enable effective data compression, leading to significantly reduced IO. Most column-store architectures leverage bitmap indices, which themselves can be compressed, for answering queries over data columns. Column-domains (e.g., geographical data, categorical data, biological taxonomies, organizational data) are hierarchical in nature, and it may be more advantageous to create hierarchical bitmap indices, that can help answer queries over different sub-ranges of the domain. However, given a query workload, it is critical to choose the appropriate subset of bitmap indices from the given hierarchy. Thus, we introduce and propose efficient solutions to the the hierarchical cut-selection (HCS) problem, which aims to help identify a subset (cut) of the nodes of the domain hierarchy, with the appropriate bitmap indices. We develop inclusive, exclusive, and hybrid strategies for cut-selection and show that the hybrid strategy can be efficiently computed and returns optimal (in terms of IO) results in cases where there are no memory constraints. We also show that when there is a memory availability constraint, the cut-selection problem becomes difficult and, thus, present efficient cut-selection strategies that return close to optimal results, especially in situations where the memory limitations are very strict (i.e., the data and the hierarchy are much larger than the available memory). Experiment results confirm the efficiency and effectiveness of the proposed cut-selection algorithms and the HCS system developed base don these principles.
      • Parth Nagarkar and K. Selçuk Candan. HCS: Hierarchical Cut Selection for Efficiently Processing Queries on Data Columns using Hierarchical Bitmap Indices. Proc. 17th International Conference on Extending Database Technology (EDBT). Athens, Greece, 2014.

Related grants:

NSF III: Small: RanKloud: Data Partitioning and Resource Allocation Strategies for Scalable Multimedia and Social Media Analysis, NSF Grant #116394.

HP Labs Innovation Research Program (IRP): “Data-Quality Aware Middleware for Scalable Data Analysis”, 2009-2010.

Short bio

K. Selcuk Candan is a Professor of Computer Science and Engineering at the School of Computing, Informatics, and Decision Science Engineering at the Arizona State University and is leading the EmitLab research group. He joined the department in August 1997, after receiving his Ph.D. from the Computer Science Department at the University of Maryland at College Park.

Prof. Candan's primary research interest is in the area of management of non-traditional, heterogeneous, and imprecise (such as multimedia, web, and scientific) data.  His various research projects in this domain are funded by diverse sources, including the National Science Foundation, Department of Defense, Mellon Foundation, and DES/RSA (Rehabilitation Services Administration). He has published over 140 articles and many book chapters. He has also authored 9 patents. Recently, he co-authored a book titled "Data Management for Multimedia Retrieval" for the Cambridge University Press and co-edited "New Frontiers in Information and Software as Services: Service and Application Design Challenges in the Cloud" for Springer.

Prof. Candan served an editorial board member of one of the most respected database journals, the Very Large Databases (VLDB) journal. He is currently an associate editor for the IEEE Transactions on Multimedia and the Journal of Multimedia. He has served in the organization and program committees of various conferences. In 2006, he served as an organization committee member for SIGMOD'06, the flagship database conference of the ACM and one of the best conferences in the area of management of data. In 2008, he served as a PC Chair for another leading, flagship conference of the ACM, this time focusing on multimedia research (MM'08). More recently, he served as a program committee group leader for ACM SIGMOD’10. He also served in the review board of the Proceedings of the VLDB Endowment (PVLDB). In 2011, he served in the Executive Committee of ACM SIGMM.

In 2010, he was a program co-chair for ACM CIVR'10 conference and a program group leader for ACM SIGMOD'10. In 2011, he is serving as a general co-chair for the ACM MM'11 conference. In 2012, he served as a general co-chair for ACM SIGMOD'12. In 2015, he will serve as a general co-chair for IEEE International Conference on Cloud Engineering (IC2E'15).

He is a member of the Executive Committee of ACM SIGMOD and an ACM Distinguished Scientist.

For his curriculum vitae, please click here.