Third Annual SRL/ISSDM Research Symposium

UCSC Systems Oktoberfest

October 18-19, 2011

Basking School of Engineering

University of California, Santa Cruz



Keynote Speaker:  Richard Golding, Senior Software Architect, Kinsey Technical Services

Biography:  Dr. Richard Golding is a Senior System Architect with KTSi, leading the software architecture for the DARPA System F6 program at NASA Ames.   He received his PhD in Computer Science from UC Santa Cruz, and did a postdoc at the Vrije Universiteit Amsterdam.  He has worked on self-managed systems, distributed systems, and adaptive real-time systems at HP Labs, Panasas, and IBM Almaden before joining the DARPA/NASA team.

Keynote Address:  “Security and Real-time in Open Distributed Systems -- a Research Challenge Agenda”

Mission- and life-critical systems that have traditionally been implemented as isolated systems are now being deployed in open, networked environments, from the smart grid to medical systems to defense systems.  Deploying these systems in a new environment, where many of the old design assumptions no longer hold, is creating a need and an opportunity for research into new ways to design these systems.  In this talk I will lay out some of the key problems that need solution.

Session I: Scientific Data Management – Part 1 (Chair: Carlos Maltzahn)

“SciHadoop: Array-based Query Processing in Hadoop”Joe Buck (PhD Student, UCSC)

Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoop's byte stream data model causes inefficiencies when used to process scientific data that is commonly stored in highly-structured, array-based binary file formats resulting in limited scalability of Hadoop applications in science. We introduce SciHadoop, a Hadoop plugin allowing scientists to specify logical queries over array-based data models. SciHadoop executes queries as map/reduce programs defined over the logical data model. We describe the implementation of a SciHadoop prototype for NetCDF data sets and quantify the performance of five separate optimizations that address the following goals for several representative aggregate queries: reduce total data transfers, reduce remote reads, and reduce unnecessary reads.  Two optimizations allow holistic aggregate queries to be evaluated opportunistically during the map phase; two additional optimizations intelligently partition input data to increase read locality, and one optimization avoids block scans by examining the data dependencies of an executing query to prune input partitions. Experiments involving a holistic function show run-time improvements of up to 8x, with drastic reductions of IO, both locally and over the network. (Slides)

 “Push-based Processing of Scientific Data” – Noah Watkins (PhD Student, UCSC)

Large-scale scientific data is collected through experiment and produced by simulation. This data in turn is commonly interrogated using ad-hoc analysis queries, and visualized with differing interactivity requirements. At extreme scale this data can be too large to store multiple copies, or may be easily accessible for only a short period of time.  In either case, multiple consumers must be able to interact with the data. Unfortunately, as the number of concurrent users accessing storage media increases the throughput can decrease significantly. This performance degradation is due to the induced random access pattern that results from uncoordinated I/O streams. One common approach to this problem is to use collective I/O, unfortunately this is difficult to do for many independent computations. We are investigating a data centric, push-based approach inspired by work within the database community that has achieved an order of magnitude increase in throughput for concurrent query processing. A push-based approach to query processing uses a single data stream originating off of storage media rather than allowing multiple requests to compete, and utilizes work and data sharing opportunities exposed through query semantics. There are many challenges that exist in this work, notably supporting a distributed execution environment, providing a mix of access performance requirements (throughput vs. latency), and support for multiple data models including relational and array-based. (slides)

“Insertion Optimized File System” – Latchesar Ionkov (Researcher, LANL and PhD Student, UCSC)

Gostor is an experimental platform for testing new file storage ideas for post POSIX usage. Gostor provides greater flexibility for manipulating the data within the file, including inserting and deleting data anywhere in the file, creating and removing holes in the data, etc. Each modification of the data creates a new file. Gostor doesn't implement any ways of organizing the files in hierarchical structures, or mapping them to strings. Thus Gostor can be used to implement standard file systems as well as experimenting with new ways of storing and accessing users' data.

Session II: Scientific Data Management – Part 2 (Chair: Neoklis Polyzotis)

“FLAMBES: Evolving Fast Performance Models”Adam Crume (PhD Student, UCSC)

Large clusters and supercomputers are simulated to aid in design. Many devices, such as hard drives, are slow to simulate. Our approach is to use a genetic algorithm to fit parameters for an analytical model of a device. Fitting focuses on aggregate accuracy rather than request-level accuracy since individual request times are irrelevant in large simulations. The model is fitted to traces from a physical device or a known device-accurate model. This is done once, offline, before running the simulation. Execution of the model is fast, since it only requires a modest amount of floating point math and no event queuing. Only a few floating-point numbers are needed for state. Compared to an event-driven model, this trades a little accuracy for a large gain in performance. (Slides)

 Divergent Physical Design Tuning” Jeff LeFevre (PhD Student, UCSC) and Kleoni Ioannidou (Post-Doc, UCSC)

We introduce a new method for tuning the physical design of replicated databases. Our method installs a different (divergent) index configuration to each replica, thus specializing replicas for different subsets of the database workload. We analyze the space of divergent designs and show that there is a tension between the specialization of each replica and the ability to load-balance the database workload across different replicas. Based on our analysis, we develop an algorithm to compute good divergent designs that can balance this trade-off. Experimental results demonstrate the efficacy of our approach.

Session III: Storage Systems (Chair: Scott Brandt)

“QMDS: A File System Metadata Management Service Supporting a Graph Data Model-based Query Language” Sasha Ames (PhD Student, UCSC)

File system metadata management has become a bottleneck for many data-intensive applications that rely on high-performance file systems.  Part of the bottleneck is due to the limitations of an almost 50-year-old interface standard with metadata abstractions that were designed at a time when high-end file systems managed less than 100MB. Today's high-performance file systems store 7 to 9 orders of magnitude more data, resulting in numbers of data items for which these metadata abstractions are inadequate, such as directory hierarchies unable to   handle complex relationships among data.

Users of file systems have attempted to work around these inadequacies by moving application-specific metadata management to relational databases to make metadata searchable. Splitting file system metadata management into two separate systems introduces inefficiencies and systems management problems.

  To address this problem, we propose QMDS: a file system metadata management service that integrates all file system metadata and uses a graph data model with attributes on nodes and edges. Our service uses a query language interface for file identification and attribute retrieval. We present our metadata management service design and architecture and study its performance using a text analysis benchmark application.   Results from our QMDS prototype show the effectiveness of this approach.  Compared to the use of a file system and relational database, the QMDS prototype shows superior performance for both ingest and query workloads.

RAID4S: Supercharging RAID Small Writes with SSD” – Rosie Wacha (PhD Student, UCSC)

Parity-based RAID techniques improve data reliability and availability, but at a significant performance cost, especially for small writes. Flash-based solid state drives (SSDs) provide faster random I/O and use less power than hard drives, but are too expensive to substitute for all of the drives in most large-scale storage systems. We present RAID4S, a cost-effective, high-performance technique for improving RAID small-write performance using SSDs for parity storage in a disk-based RAID array. Our results show that a 4HDD+1SSD RAID4S array achieves throughputs 3.3X better than a similar 4+1 RAID4 array and 1.75X better than a 4+1 RAID5 array on small-write-intensive workloads. RAID4S has no performance penalty on disk workloads consisting of up to 90% reads and its benefits are enhanced by the effects of file systems and caches. (Poster)

 “Managing High-Bandwidth Real-Time Data Storage” – David Bigelow (PhD Student, UCSC)

In an information-driven world, the ability to capture and store data in real-time is of the utmost importance.  The scope and intent of such data capture, however, varies widely.  Individuals record television programs for later viewing, governments maintain vast sensor networks to warn against calamity, scientists conduct experiments requiring immense data collection, and automated monitoring tools supervise a host of processes which human hands rarely touch.  All such tasks have the same basic requirements -- guaranteed capture of streaming real-time data -- but with greatly differing parameters.  Our ability to process and interpret data has grown faster than our ability store and manage it, which has led to the curious condition of being able to recognize the importance of data without being able to store it, and hence unable to later profit by it.  Traditional storage mechanisms are not well suited to manage this type of data and we have developed a large-scale ring buffer storage architecture to handle it.  Our system is well suited to both large and small data elements, has a native indexing mechanism, and can maintain reliability in the face of hardware failure.  Strong performance guarantees can be made and kept, and quality of service requirements maintained. (Slides)

Session IV: Distributed Storage Performance Management (Chair: Scott Brandt)

“Performance Management for Disk Storage Devices as a Black Box” Dimitrios Skourtis (PhD Student, UCSC)

Given a storage device with multiple disks and a collection of request streams, we study the problem of managing the performance of those streams, without having control over each disk separately, e.g., network-attached storage. Instead, we manage the device as a whole by taking advantage of the order-of-magnitude execution time differences between sequential and random requests, which helps us estimate the disk execution time of each stream. (Slides)

“RUN: Optimal Multiprocessor Real-Time Scheduling via Reduction to Uniprocessor” Greg Levin (PhD Student, UCSC)

Existing optimal multiprocessor real-time schedulers incur significant overhead for preemptions and migrations. We present RUN, a new approach to optimal scheduling which reduces the multiprocessor problem to a series of uniprocessor problems. RUN’s average preemptions per job never exceeded 3 in any simulation, and has a provable upper bound of 4 for most task sets. It also reduces to Partitioned EDF whenever a proper partitioning is found, and significantly outperforms existing optimal algorithms. (slides)

“Rad-Flows: Buffering for Predictable Communications” Kleoni Ioannidou (Post-doctoral fellow, UCSC)

Real-time systems and applications are becoming increasingly complex and often comprise multiple communicating tasks. The management of the individual tasks is well understood, but the interaction of communicating tasks with different timing characteristics is less well understood. We discuss several representative inter-task communication flows via reserved memory buffers (possibly interconnected via a real-time network) and present RAD-Flows, a model for managing these interactions. We provide proofs and simulation results demonstrating the correctness and effectiveness of RAD-Flows, allowing system designers to determine the amount of memory required based upon the characteristics of the interacting tasks and to guarantee real-time operation of the system as a whole


Keynote Speaker: Gary Grider, Deputy Division Leader, Los Alamos National Laboratory

Biography:    Gary currently is the Deputy Division Leader of the High Performance Computing (HPC) Division at Los Alamos National Laboratory, where he is responsible for managing the personnel and processes required to stand up and operate major supercomputing systems, networks, and storage systems for the Laboratory for both the DOE/NNSA Advanced Simulation and Computing (ASC) program and LANL institutional HPC environments. One of his main tasks is conducting and sponsoring R&D to keep the new technology pipeline full and provide solutions to problems in the Lab’s HPC environment. Gary is also the LANL lead in coordinating DOE/NNSA alliances with universities in the HPC I/O and file systems area. He is one of the principal leaders of a small group of multi-agency HPC I/O experts that guide the government in its I/O related computer science R&D investments through the High End Computing Interagency Working Group HECIWG, and is the Director of the Los Alamos/UCSC Institute for Scientific Scalable Data Management and the Los Alamos/CMU Institute for Reliable High Performance Information Technology. He is also the LANL P.I. for the Petascale Data Storage Institute, a SciDAC2 Institute award-winning project. Before working for Los Alamos, Gary spent 10 years with IBM at Los Alamos working on advanced product development and test and 5 years with Sandia National Laboratories working on HPC storage systems.

Gary holds a B.S. in Electrical Engineering along with a registration for certified engineer from Oklahoma State University and the State of Oklahoma. He also received an M.B.A. with emphasis in Management Information Systems, Statistics, Physics, and Mathematics from Oklahoma State University. By far the bulk of Gary's knowledge comes from daily hands on research, design, prototyping, development and testing of new systems and hardware, and through mentoring of people and projects within the high performance storage and network areas.

Keynote Address:  “Los Alamos National Laboratory, a Unique, Irreplaceable, National Resource in the Department of Energy”

The talk will provide an unclassified overview of the Los Alamos National Laboratory, its people, programs, and capabilities.  The talk touches on much of the diverse science going on that the laboratory in areas such as materials, biology, cosmology, energy, and climate.  A drill down in the area of information science, computer science, and high performance computing, is also provided.


(Organizer: Yi Zhang)

KM Session I: WWW (Chair: Yi Zhang)

 “Mining Intent from Search Results” Jan Pedersen (Chief Scientist for Cores Search) 

The current generation of Web search services derives most of their quality signals from two sources:  the Web Graph and query session logs.  For example, Page Rank is mined from the Web Graph while query-understanding features, such as spelling correction; rely heavily on query log analysis.   The next generation of Web Search services will be distinguished more by presentation than by conventional matching and ranking.  Sophisticated presentations require understanding the intent behind a query.  For example, knowing that the query [Rancho San Antonio] names a particular place (an open space reserve in the Bay Area), not the class of ranches near San Antonio. Interestingly, Web results typically contain enough information to infer search intents in many cases.   I will outline how this can be used through post result processing to produce both improved results and improved presentations.

Viral Genomics and the Semantic Web” Carla Kuiken  (Researcher, Los Alamos National Laboratory)

The “HIV database and analysis platform” has been maintained in Los Alamos for 22 years, and has grown to be an internationally renowned resource for HIV data analysis. It is in the process of expanding to include hepatitis C virus and hemorrhagic fever viruses; the eventual goal is to make it a universal viral resource. This expansion necessitates much greater reliance on external data and information sources. These resources rarely use the same identifies and frequently contain annotator- and submitter-specific language. While efforts have been underway for some time to standardize and cross-link biological information on the web, there still is a long way to go. I will describe current status of the “Viral data analysis platform”, the (semantic) problems we have grappled with, and the local and global efforts at amelioration.

Patterns of Spam in Twitter” Aleksander Kolcz  (Software Engineer, Twitter)

The growing popularity of Twitter has been attracting significant attention from spammers. The 140 character constraint, as well as other characteristics of the Twitter service, affect both legitimate users and spammers alike, forcing spammers to adopt certain unique tactics. In this talk we will offer a glimpse of the various techniques employed by service abusers, contrast them with other types of spam and discuss the challenges they pose to automatic detection systems.

Biography: Aleksander Kolcz is a Software Engineer at Twitter focusing on applying of Machine Learning and Data Mining to modeling user interests and preventing service abuse. He has 12 years R&D experience at Microsoft, AOL and Personalogy. He received his PhD in 1996 from the University of Manchester Institute of Science and Technology.

KM Session II: Computational Advertising (Chair: James Shanahan)

Highly Dimensional Problems in Computational Advertising” Andrei Broder (Yahoo! Fellow and Vice President, Computational Advertising)

 The central problem of Computational Advertising is to find the "best match" between a given user in a given context and a suitable advertisement. The context could be a user entering a query in a search engine ("sponsored search"), a user reading a web page ("content match" and "display ads"), a user interacting with a portable device, and so on. The information about the user can vary from scarily detailed to practically nil. The number of potential advertisements might be in the billions. The number of contexts is unbound. Thus, depending on the definition of "best match" this problem leads to a variety of massive optimization and search problems, with complicated constraints. The solution to these problems provides the scientific and technical underpinnings of the online advertising industry, an industry estimated to surpass 28 billion dollars in US alone in 2011.

An essential aspect of this problem is predicting the impact of an ad on users’ behavior, whether immediate and easily quantifiable (e.g. clicking on ad or buying a product on line) or delayed and harder to measure (e.g. off-line buying or changes in brand perception). To this end, the three components of the problem -- users, contexts, and ads -- are represented as high dimensional objects and terabytes of data documenting the interactions among them are collected every day. Nevertheless, considering the representation difficulty, the dimensionality of the problem and the rarity of the events of interest, the prediction problem remains a huge challenge.

The goal of this talk is twofold: to present a short introduction to Computational Adverting and survey several high dimensional problems at the core of this emerging scientific discipline.

Biography.  Andrei Broder is a Yahoo! Fellow and Vice President for Computational Advertising. Previously he was an IBM Distinguished Engineer and the CTO of the Institute for Search and Text Analysis in IBM Research. From 1999 until 2002 he was Vice President for Research and Chief Scientist at the AltaVista Company. He was graduated Summa cum Laude from Technion, the Israeli Institute of Technology, and obtained his M.Sc. and Ph.D. in Computer Science at Stanford University under Don Knuth. His current research interests are centered on computational advertising, web search, context-driven information supply, and randomized algorithms. He has authored more than a hundred papers and was awarded over thirty patents. He is a member of the US National Academy of Engineering, a fellow of ACM and of IEEE, and past chair of the IEEE Technical Committee on Mathematical Foundations of Computing.

Machine Learning on Big Data For Personalized Internet Advertising” Michael Recce (Vice President, Quantcast.com)

Marketers have long sought more effective ways to reach their audience ­ to show the right ad to the right person at the right time.  Huge volumes of internet activity data, advances in machine learning methods, new hardware and software for large scale distributed computing, and developments in real-time decisioning have made this finally possible. Increasingly the particular advertisement that is seen on a web page is decided in a auction that takes place in a fraction of a second, while the page is loading.  In this presentation I will discuss how we, at Quantcast, meet the challenges in personalizing advertising.  

This process involves multiple machine learning methods to evaluate of about 15 billion individual daily media events and leveraging this data to to make precise bids in almost 100,000 auctions every second.

Biography: Dr. Michael Recce has been managing the Modeling team at Quantcast for the past year and a half. Prior to Quantcast, he led Fortent¹s transaction monitoring and risk assessment systems. For seven years, Michael worked extensively with financial institutions devising improved methods for detecting unusual activity in financial transaction data.  Early in his career, Michael was a product engineering manager at Intel Corporation, where he led the development of new memory products for the company. Other projects he has worked on include the design of a control system for a space-based robot for Daimler-Benz, which was developed to run scientific and engineering experiments in the space station. Michael holds six patents, including one for research of a behavioral biometric called dynamic grip recognition, and was a recipient of the Thomas A. Edison Award in 2005. He has been a lecturer at University College, London and a professor of information systems at New Jersey Institute of Technology. He received his bachelor's degree from the University of California-Santa
Cruz and his doctorate from University College, London.

“Privacy and Effectiveness in Internet Advertising” Qi Zhao (PhD Student, UCSC)

With the proliferation of diverse Internet services and applications, individuals are confronted with the risk of losing their privacy through providing personal information for enjoying the service. In this talk, we will first address typical scenarios where privacy breach occurs and then provide a brief overview of existing approaches handling privacy issues. In the last, we focus on discussing privacy preservation for data sharing at BlueKai Inc. Such a data setting distinguishes itself from previous data setting in the sense of much larger number of records and much higher dimensional attribute vector. These challenges poses great challenges to existing approaches and motivates the idea of reducing the certainty of individuals' profile via noise injection. The feasibility and effectiveness of the proposed method is demonstrated by applying it to the simulated campaigns for Expedia.

Biography: Qi Zhao is currently a Ph.D student at IRKM Lab, UCSC. He found his research interests in applying statistic knowledge and machine learning techniques to solving problems involved with large-scale data. Specifically speaking, Qi is now engaged in developing algorithms for Internet privacy preservation. Prior to coming to UCSC, he obtained both his M.S degree and B.S degree at Fudan University, China.

KM Session III: Search and Recommendation (Chair: Yi Zhang)

“Utilizing Marginal Net Utility for Recommendation in E-commerce” Jian Wang (PhD Student, UCSC)

Traditional recommendation algorithms often select products with the highest predicted ratings to recommend. However, earlier research in economics and marketing indicates that a consumer usually makes purchase decision(s) based on the product's marginal net utility (i.e., the marginal utility minus the product price). Utility is defined as the satisfaction or pleasure user gets when purchasing the corresponding product. A rational consumer chooses the product to purchase in order to maximize the total net utility. In contrast to the predicted rating, the marginal utility of a product depends on the user's purchase history and changes over time. 

To better match users' purchase decisions in the real world, we explore how to recommend products with the highest marginal net utility in e-commerce sites. Inspired by the Cobb-Douglas utility function in consumer behavior theory, we propose a novel utility-based recommendation framework. The framework can be utilized to revamp a family of existing recommendation algorithms. To demonstrate the idea, we use Singular Value Decomposition (SVD) as an example and revamp it with the framework. We evaluate the proposed algorithm on an e-commerce (shop.com) data set. The new algorithm significantly improves the base algorithm, largely due to its ability to recommend both products that are new to the user and products that the user is likely to re-purchase.

Biography: Jian Wang is a third year Ph.D. student in University of California, Santa Cruz. She works with Prof. Yi Zhang in the information retrieval and knowledge management lab there. Her research interests include recommender system, information retrieval and data mining. She has published in ACM SIGIR, ACM Recsys, ACM CIKM and so on. Jian received a master degree in Lehigh University in Pennsylvania in 2009 and bachelor degree in Fudan University in 2007. She worked in eBay research lab to help build the post-purchase recommendation engine, as well as the IBM websphere team before.

 “Recommendation System for the Facebook Open Graph” Wei Xu (Researcher, Facebook)

The Open Graph at Facebook contains very rich connections between hundreds of millions of users and billions of objects. Recommendation technology is important for finding the most interesting objects for the users from the huge amount of objects in the Graph. In this talk, I will give the following:  1) an overview of the different recommendation tasks we are facing at Facebook; 2) the tools we provide to the developers for accessing object recommendations from the Graph; and 3) the challenges and solutions for building such a recommendation system.

“Recommender Systems at the Long Tail” Neel Sundaresan (Senior Director, Ebay)

Online Recommender systems are essential to eCommerce. A complex marketplace like eBay poses unique challenges and opportunities. The large diversity in the item space, the buyer and seller space introduces super-sparsity at scale. However, the elaborate transaction flow offers opportunities for a wide class of recommender applications. In this talk we will discuss these challenges, opportunities, and systems for recommendations.

“Filtering Semi-Structured Documents Based on Faceted Feedback” Lanbo Zhang (PhD Student, UCSC)

Existing adaptive filtering systems learn user profiles based on users' relevance judgments on documents. In some cases, users have some prior knowledge about what features are important for a document to be relevant. For example, a Spanish speaker may only want news written in Spanish, and thus a relevant document should contain the feature "Language: Spanish"; a researcher working on HIV knows an article with the medical subject "MeSH: AIDS" is very likely to be interesting to him/her.

Semi-structured documents with rich faceted metadata are increasingly prevalent over the Internet. Motivated by the commonly used faceted search interface in e-commerce, we study whether users' prior knowledge about faceted features could be exploited for filtering semi-structured documents. We envision two faceted feedback solicitation mechanisms, and propose a novel user profile-learning algorithm that can incorporate user feedback on features. To evaluate the proposed work, we use two data sets from the TREC filtering track, and conduct a user study on Amazon Mechanical Turk. Our experimental results show that user feedback on faceted features is useful for filtering. The new user profile learning algorithm can effectively learn from user feedback on faceted features and performs better than several other methods adapted from the feature-based feedback techniques proposed for retrieval and text classification tasks in previous work. (Slides)


(Organizers: Alex Pang, Bruno Sanso)

 WAVE Session I: Visual Exploration of Cosmology Data (Chair: Alex Pang)

 “Halo Finder vs Local Extractors: Similarities and Differences” Uliana Popov (PhD Student, UCSC)

Multi-streaming events are of great interest to astrophysics because they are associated with the formation of large-scale structures (LSS) such as halos, filaments and sheets. Until recently, these events were studied using scalar density field only. In this talk, we present a new approach that takes into account the velocity field information in finding these multi-streaming events. Six different velocity based feature extractors are defined, and their findings are compared to a halo finder results.(Poster) (Slides)

Families Interacting with Scientific Visualizations in an Immersive Planetarium Show” Zoe Buck (PhD Student, UCSC)

This talk will focus on findings from the first round of data collection for a research project in collaboration with Adler Planetarium, about how families are interacting with a new planetarium show. The show was designed very differently from a traditional planetarium show, favoring an immersive, visually stunning experience over a more content-rich didactic
experience centering on a live speaker. Data includes written surveys, short interviews, and extended interviews (included stimulated recall of specific show visualizations) with planetarium visitors, as well as in depth interviews with show designers. Preliminary results indicate that patrons are seeking a show that allows them to "experience space," (much as one would visit an aquarium to experience fish firsthand) rather than a didactic experience. To this end, it appears that scientific visualizations can provide a positive and productive experience even when the accompanying narrative has very little content or explanation. In addition, specific artistic decisions such as color choice can influence the ways in which planetarium visitors interpret visualization, and should therefore be considered carefully if a visualization is to be made available for educational purposes. (slides)

Visualizing High-Resolution Simulations of Galaxy Formation and Comparing the Simulations to the Latest Observations from Hubble and Other Telescopes,” Joel Primack (Professor, UCSC)

My research group at UCSC is running very high resolution hydrodynamic simulations of galaxy formation and evolution for comparison with ~250,000 images of forming galaxies from the CANDELS survey (http://candels.ucolick.org/About.html), the largest project in the history of Hubble Space Telescope, which is led by UCSC Astronomer Sandra Faber. We are using visualizations of the evolution of gas and stars to help us understand the key processes in the simulations. We are also using our state-of-the-art Sunrise radiative transfer code to make realistic images from the simulations in many wavebands, including the effects of stellar evolution and dust scattering, absorption, and re-emission. We are using automatic neural net machine learning to classify these images and those from CANDELS (using thousands of CANDELS images that have been classified by astronomers as a training set).  We need help applying computer vision tools to extract and analyze the most relevant features in the observational and simulated images, in order to determine the extent to which the simulations agree with the observations, so that we can use the simulations to interpret the observations reliably. (Slides)

WAVE Session II: Analysis of Cosmology Data (Chair: Bruno Sanso)

 “Gaussian Process Modeling of Dark Energy” Tracy Holsclaw (Alumna, UCSC) 

Gaussian process (GP) models provide non-parametric methods to fit continuous curves observed with noise. Motivated by our investigation of dark energy, we develop a GP-based inverse method that allows for the direct estimation of the derivative of a curve. In principle, a GP model may be fit to the data directly, with the derivatives obtained by means of differentiation of the correlation function. However, it is known that this approach can be inadequate due to loss of information when differentiating. We present a new method of obtaining the derivative process by viewing this procedure as an inverse problem. We use the properties of a GP to obtain a computationally efficient fit. We illustrate our method with simulated data as well as apply it to our cosmological application. (Slides)

Science with Cosmic Visualization” Katrin Heitmann  (Researcher, Argonne National Laboratory)

The formation of the large-scale structure in the Universe is a highly complex, nonlinear process. Understanding the physics of structure formation requires large simulations tracking the evolution of the Universe over time. In order to gain deeper insights into the simulation results, visualization is an invaluable tool. In this talk I will show some results where visualization helped as to sharpen our understanding of large scale structure formation, understand differences in the results from different computational approaches to solve the N-body problem, and tracking the formation of structures over time. We have implemented important analysis tools into ParaView for cosmological simulation, which I will introduce.

How to Stuff a Supercomputer into a Laptop and Help Invert the Universe” Salman Habib  (Researcher, Argonne National Laboratory)

In more than a century and a half of effort, observations of the deep sky have made remarkable contributions to our knowledge of cosmology. Today, there exists a well-measured cosmological model that fits all known data to accuracies better than 10 percent. Astonishingly, it is possible that within the next decade or so, observational errors will be reduced to the percent level. Because cosmology is an observational, not experimental science, precision scientific inference is a matter of solving a statistical inverse problem using Markov chain Monte Carlo (MCMC) techniques. But because the forward model evaluation is so expensive -- large supercomputer codes must be run in each case to obtain predictions at better than a per cent accuracy -- it is hopeless to proceed via brute force. In this talk I will describe 'cosmic calibration', a statistical framework we have recently developed that uses sophisticated sampling design and interpolating strategies in high-dimensional spaces, combined with results from a finite set of simulations run over a range of cosmological parameters. The framework produces an accurate 'oracle' that can be run essentially instantaneously on a laptop; the oracle, or 'emulator', yields accurate results for observables at parameter values that lie anywhere within the range of the parameter sampling design. Cosmic calibration enables MCMCs to be run on laptops in tens of minutes instead of many years on supercomputers. The basic ideas can be applied to many other fields.


(Organizers: James Davis, Neoklis Polyzotis)

 HC Session I: Applications of Human Computation (Chair: Neoklis Polyzotis)

 “Computer Vision using Human Computation” James Davis (Professor, UCSC) 

Computer-mediated human micro-labor markets allow human effort to be treated as a programmatic function call. We can characterize these platforms as Human Processing Units (HPU). HPU computation can be more accurate than complex CPU based algorithms on some important computer vision tasks. We also argue that HPU computation can be cheaper than state-of-the-art CPU based computation. I'll give some examples of simple computer vision tasks that we have evaluated, and speculate on whether a finite computer vision instruction set is possible. The instruction set would allow most computer vision problems to be coded from the base instructions, and the instructions themselves would be made robust with the help of human computation.

Recommendation based De-IDentification (Re-Did)” Varun Bhagwan  (PhD Student, UCSC)

De-identification is the process of removing and or transforming informational artifacts in data such that they cannot be used to identify, contact or locate an individual. A significant amount of legislative, technical and social effort has been and is being placed on ensuring that no personally identifiable information (PII) is released when health- care data is shared, while maintaining the usability and usefulness of the data - that of being able to analyze medical records for point-of-care outcome, treatment evaluation, comparative effectiveness, retrospective studies, and various other secondary uses of data. Traditionally, techniques that perform this de-identification are done by software or by humans. Currently, computer algorithms that discover and de-identify PII have well-known shortcomings with respect to re-identification risk and usability reduction. Thus, privacy and anonymity via algorithmic techniques cannot be guaranteed, only optimized within a specific context. Human-centric efforts that manually identify and transform sensitive content have proven to be inefficient and infeasible, especially for large datasets, and produce resultant sets with a high proportion of errors. Our primary goal is to keep sensitive (patient) data private, while sharing (medical) data pertinent for secondary analysis. 

We propose the Re-Did (Recommendation-based De-Identification) framework that couples the algorithmic and human-centric approaches to de-identification by leveraging their individual advantages in order to improve the quality of the results produced. While humans tend to be more precise with the annotations they identify, automated de-identification algorithms are more scalable and can identify a larger number of candidate annotations. The Re-Did system exploits this complementary nature of human adjudication vs. automated de-identification. We accomplish this by introducing a recommendation layer, which analyzes and augments the output of an algorithmic de-identification program to generate PII and non-PII candidates; these candidates are then presented to the human worker(s) for adjudication. The benefit of this approach is that it leverages good recall and scalability coming from the automated de-identification, with good precision and ambiguity resolution from human computation. To evaluate the effectiveness of the PII removal, we also develop an evaluation framework and provide a methodology.

Using Human Computation to Create a Natural Language Query Interface for Relational Databases” Bogdan Alexe  (PhD Student, UCSC)

In many practical scenarios, non-expert users need to interact with relational databases without being familiar with SQL, the prevalent query language for such databases. In this work we investigate a novel approach to solving the problem of translating natural language questions into SQL queries. We propose the use of human computation as the backbone of our solution: leveraging large communities of workers remunerated by micropayments for drafting and refining translations. In addition to describing some initial ideas towards this end, we introduce a measure that quantifies the quality of produced translations, and present an experimental evaluation of our work using this measure.

 “That's Your Evidence?: Using Mechanical Turk To Develop A Computational Account Of Debate And Argumentation In Online Forums ” Marilyn Walker  (Professor, UCSC)

A growing body of work has highlighted the challenges of identifying the stance that a speaker holds towards a particular topic, a task that involves identifying a holistic subjective disposition. We examine stance classification on a corpus of 4731 posts from the debate website ConvinceMe.net, for 14 topics ranging from the playful to the ideological. We show that ideological debates feature a greater share of rebuttal posts, and that rebuttal posts are significantly harder to classify for stance, for both humans and trained classifiers. We also demonstrate that the number of subjective expressions varies across debates, a fact correlated with the performance of systems sensitive to sentiment-bearing terms. We present results for classifying stance on a per topic basis that range from 60% to 75%, as com- pared to unigram baselines that vary between 47% and 66%. Our results suggest that features and methods that take into account the dialogic context of such posts improve accuracy. (slides) (paper)

HC Session II: Infrastructure for Crowdsourced Applications (Chair: James Davis)

 “Online Reputation Systems ” Luca de Alfaro (Professor, UCSC)

Reputation systems constitute the on-line equivalent of the body of social norms, laws, and regulations, that keep people interacting productively in the off-line world.  As collaboration, content creation, and interactions move to the on-line world, reputation systems are increasingly used to provide incentives for constructive behavior, discourage vandalism, and help collaboration and the emergence of high-quality content.
We describe reputation systems we have developed for Wikipedia and Google Maps, examining their structure, their effectiveness in facilitating collaboration and fighting vandalism, and their susceptibility to attacks. From this analysis, we derive some basic design principles that can be applied to a wide variety of on-line collaborative settings.

Crowdsight: Rapidly Prototyping Visual Processing Apps” Mario Rodriguez (PhD Student, UCSC)

We describe a framework for rapidly prototyping applications which require intelligent visual processing, but for which reliable algorithms do not yet exist, or for which engineering those algorithms is too costly. The framework, CrowdSight, leverages the power of crowdsourcing to offload intelligent processing to humans, and enables new applications to be built quickly and cheaply, affording system builders the opportunity to validate a concept before committing significant time or capital. Our service accepts requests from users either via email or simple mobile applications, and handles all the communication with a backend human computation platform. We build redundant requests and data aggregation into the system, freeing the user from managing these requirements. We validate our framework by building several test applications and verifying that prototypes can be built more easily and quickly than would be the case without the framework. (Slides)

 “Top-1 Algorithms over MTurk” Kerui Huang (PhD student, UCSC)

Crowdsourcing is growing very fast nowadays. Besides some obvious advantages such as convenience, speed, flexibility, economy and so forth, we are interested its quality as well. This talk describes our experiments on using Mechanical Turk to identify the top-1 item in a set. We evaluate two intuitive algorithms (termed Tournament and Bubble) under different settings for the pricing of tasks, the UI presented to human workers, the control for the quality of returned answers, and the total execution time of each algorithm. Our results show that the Tournament algorithm is more accurate and robust in practice. (Slides)

The Stanford-UC Santa Cruz Project for Cooperative Computing with Algorithms, Data and People” Neoklis Polyzotis  (Professor, UCSC)

This talk will provide an overview of the SCOOP project, whose broad theme is to leverage people as processing units to achieve some global objective. A primary focus of SCOOP is to optimize the usage of human computation in order to use as few resources (e.g., time, money) as possible while maximizing the quality of the final output. Our approach is based on the principle of declarative languages that has been applied very successfully in database systems. The talk will describe the main research thrusts in SCOOP and some of our recent accomplishments.


The session will feature posters for several of the above presentations and the following projects.

“A Parametric Model of Electromagnetic Environments” Janelle Yong (PhD Student, UCSC), John Galbraith, Eric Raby, and Don Wiberg

Remote sensing is a technique used to gain information and to collect data about objects that are not in contact with the sensors. We can obtain coupled RF energy from arbitrary circuits using remote sensing. Maxwell’s equations are used to describe the physics of the response of the target circuit to a sensing antenna. However, Maxwell’s equations are too complicated to be solved for the situation because of the complex real world electromagnetic environment the circuit is in. Harmonic Inversion is a useful algorithm that will allow us to create a parametric model to understand the RF coupling in electromagnetic environments. We are able to achieve higher frequency resolution for a smaller number of samples using Harmonic Inversion rather than the use of Discrete Fourier Transforms. (poster)

Learning to Active Learn with Applications in the Online Advertising >> Field of Look-Alike Modeling James G. Shanahan (Professor, UCSC),and Nedim Lipka (Researcher, Bauhaus-Univ. Weimer)

Digital online advertising is a form of promotion that uses the Internet and World Wide Web for the express purpose of delivering marketing messages to attract customers. Online advertising can be targeted based on a user's online behavior. The goal of look-alike data modeling is to build a computational model that can readily identify individuals online that look-alike individuals who have already transacted (e.g., purchased a product or service) as a result of seeing a particular advertisement. This problem can be modeled as a binary supervised classification problem where individuals who were exposed to a particular advertisement and subsequently transacted can be treated as positive examples and individuals who were exposed to the same ad and who have not transacted can be treated as negative examples. Obtaining positive examples (individuals who transacted) in this scenario can often prove expensive for advertisers. Active learning, a sub-discipline of supervised learning, addresses this concern head-on by focusing on training models that achieve greater accuracy with fewer training labels. This is largely accomplished by allowing the active learning framework to choose the data from which it learns. This has been traditionally accomplished using heuristics such as, selecting examples with high degrees of uncertainty and using cluster hypothesis practices where examples are selected heuristically from clusters of data. In this paper we propose a more general selection framework based upon machine learning, where the goal is to learn a selection model from example data (i.e., learn to active learn). The proposed approach will be demonstrated within an online advertising framework where positive examples will correspond to selections (showing a particular ad to a particular website visitor) that resulted in transactions. Reduced exposures of ads will lead to more effective advertising and reduced costs for advertisers. At the time of writing, this work is work-in-progress and it is hoped to report on the results of our findings during the workshop.