Annual PDL Consortium Speaker Series May 7, 2019 12:00pm — 5:00pm Location: Panther Hollow Conference Room 4101 - Robert Mehrabian Collaborative Innovation Center Speaker: A One-afternoon Series of Special Talks by PDL Consortium Visitors http://www.pdl.cmu.edu/SDI/2019/050719.html SCHEDULE: (Please note - speaker order may change, see website)12:00 - 12:45 pm: Ehsan Totoni, Intel12:45 - 1:30 pm: Luis Remis, Intel1:30 - 1:50: Break1:50 - 2:35 pm: Sathya Gunasekar, Facebook2:35 - 3:20 pm: Jim Cipar, Facebook3:20 - 3:40: Break3:40 - 4:25: Aurosish Mishra, Oracle4:25 - 5:10: Pat Helland, SalesforceEhsan Totoni, Intel — HPAT: compiling Python analytics codes to optimized HPC binary automatically Data science and AI approaches promise intelligent applications using insights from large datasets. However, data scientists require programming frameworks that allow easy experimentation on big data, and productive development of complex data workloads. In addition, deployment of data applications requires high scalability and efficiency. Existing big data frameworks (e.g. Apache Spark) are far away from meeting these needs due to their high overhead master/executor library-based approach.High Performance Analytics Toolkit (HPAT) is a compiler-based big data framework that compiles Python analytics codes to optimized binaries with MPI automatically, providing Python productivity and HPC efficiency simultaneously. HPAT uses several domain-specific compiler techniques to achieve this goal, including a new auto-parallelization compiler algorithm that detects the underlying map/reduce parallel pattern. Furthermore, HPAT performs high level optimizations (e.g. fusion of operators) by treating analytics APIs (Pandas/Numpy) as a domain-specific language (DSL), but without imposing DSL limitations to the programmer. Performance evaluation of HPAT shows up to 2000x speedup over Spark for common benchmarks. In addition, several real applications demonstrate the benefits of the HPAT approach, including deployment from cloud to edge without any code rewrite. HPAT is under development as a software product to enable the next generation of data-centric applications.BIO: Ehsan Totoni is a Research Scientist at Intel Labs, working on programming systems for large-scale big data analytics that provide high programmer productivity as well as high performance on modern hardware. He received his Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign in 2014. During his Ph.D. studies, he was a member of the Charm++/AMPI team working on performance and energy efficiency of HPC systems.SPEAKER: Luis Remis, Intel — VDMS: Your Favorite Visual Data Management System We introduce the Visual Data Management System (VDMS), which enables faster access to big-visual-data and adds support to visual analytics. This is achieved by searching for relevant visual data via metadata stored as a graph, and enabling faster access to visual data through new machine-friendly storage formats. VDMS differs from existing large scale photo serving, video streaming, and textual big-data management systems due to its primary focus on supporting machine learning and data analytics pipelines that use visual data (images, videos, and feature vectors), treating these as first class entities. We will describe how to use VDMS via its user friendly interface, and how the system integrates with the rest of the components for E2E Visual processing that are ongoing work at Intel Labs.BIO:Luis Remis is a member of the Systems and Software Research Group at Intel Labs, where his current research involves Cloud Systems for Visual Data. He has been working on the Visual Data Management System project since its inception. He holds an M.S. in Computer Science from the University of Illinois at Urbana-Champaign (UIUC), where he was a Research Assistant working on graph processing using heterogeneous platforms. His industry experience includes being part of the Modeling team at the Aerospace Division at INVAP from 2012 to 2014, where he worked on R&D for radar signal processing using graphics accelerators, and being part of the Autopilot team at Tesla Motors in 2015.Sathya Gunasekar, Facebook — CacheLib - Unifying & Abstracting HW for caching @ FB In order to operate with high efficiency, Facebook's infrastructure relies on caching in many different backend services. These services place very different demands on their caches, e.g., in terms of working set sizes, access patterns, and throughput requirements. Historically, each service used a different cache implementation, leading to inefficiency, duplicated code and effort.CacheLib is an embedded caching engine, which addresses this requirement with a unified API for building a cache implementation across many HW mediums. CacheLib transparently combines volatile and non-volatile storage in a single caching abstraction. To meet the varied demands, CacheLib successfully provides a flexible, high-performance solution for many different services at Facebook. In this talk, we describe CacheLib's design, challenges, and several lessons learned.BIO: I am a software engineer in the Cache Infrastructure at Facebook since 2012. Cache Infrastructure develops and operates services to provide efficient, online access to social graph data. It encompasses services like TAO, Memcache and libraries like CacheLib that enable building cache services at Facebook. I graduated with a masters in Computer Science from University of Wisconsin Madison.Jim Cipar, Facebook — Intelligent Caching: Using Machine Learning to Reduce Flash Wear by 60% Large caches stored on flash memory present different design challenges than tradition RAM-based caches. For instance, the admission policy -- choosing whether to add an object to cache or drop it -- is more important than the eviction policy -- choosing which object, of many, to remove from cache when it is full. In this talk we give an overview of the types of caches used by Facebook's content delivery network, and why cache admission is an important problem. We describe an admission policy based on machine learning, and show how it can significantly improve important metrics such as flash write rate, and cache latency.BIO: Jim Cipar is a software engineer on Facebook's MLX team, focusing on applying machine learning to networking and infrastructure challenges. In addition to caching, the MLX team works on live database queries, service stress testing, load balancing, and build/test automation. He earned his PhD from Carnegie Mellon, and is a PDL alum.Aurosish Mishra, Oracle — Oracle Autonomous Database - Sit back, relax and let Oracle do the driving! Oracle Autonomous Database is the industry's first self-driving, self-securing and self-repairing cloud database. It combines decades of database automation techniques and database infrastructure development, with the power of machine learning to deliver a fully autonomous database that revolutionizes data management, enabling enterprises to evolve from the role of builders and managers of databases to users of autonomous database cloud services that offer self-driving capabilities - for any workload! In this talk, we will peek under the hood of the Oracle Autonomous Database, and understand how Oracle achieved the autonomous vision, making the world's best database also the world's simplest.BIO: Aurosish Mishra is a Software Development Manager in the Oracle Database Engine group. His team is responsible for building the storage engine for the next-gen, cloud-scale Oracle Autonomous Database leveraging innovative technologies such as NVM storage, RDMA access and SIMD processing. He also leads the development of key features for Oracle's flagship Database In-Memory Engine that provides real-time analytics at the speed of thought. Aurosish holds a Master's degree in Computer Science from Cornell University, and a Master's/Bachelor's degree in Computer Science from IIT Kharagpur.Pat Helland, Salesforce — Standing on the Distributed Shoulders of Giants If you squint hard enough, many of the challenges of distributed computing appear similar to the work done by the great physicists. Dang, those fellows were smart! Here, I examine some of the most important physics breakthroughs and draw some whimsical parallels to phenomena in the world of computing... just for fun.BIO: Pat Helland has been implementing transaction systems, databases, application platforms, distributed systems, fault-tolerant systems, and messaging systems since 1978. For recreation, he occasionally writes technical papers. He currently works at Salesforce.. Event Website: http://www.pdl.cmu.edu/SDI/2019/050719.html For More Information: karen@ece.cmu.edu