An Algorithmic and Information-Theoretic Toolbox for Massive Data

ECE6980
An Algorithmic and Information-Theoretic Toolbox for Massive Data

Logistics

Instructor: Jayadev Acharya, 304 Rhodes Hall
Lectures: TuTh 1325-1440 hours, 203 Phillips Hall
Office hours: MoTh 1500-1600 hours, 304 Rhodes Hall
Course description

Overview

Massive systems are being built to model, analyze, and make inference from data. These systems will be increasingly complex. We will consider some of the core primitives, which are integral to building such systems. The lectures will consist of three units:

Unit 1: Learning. Discrete distributions, shape restrictions, mixture models, Scheffe estimator, competitive learning, Good-Turing probability estimation.

Unit 2: Testing. Closeness of distributions, classification, testing identity, competitive testing, testing properties.

Unit 3: Estimation. Estimating Shannon and Rényi entropy of distributions, moments, support size, unseen elements.

In each unit, we will discuss the design of efficient algorithms, prove the information theoretic limits, and brainstorm on some future directions. We will also try to understand these problems under constraints on resources such as memory and communication.
The first few lectures will include preliminaries such as concentration inequalities, and basic information theory.

Grading

Assignments 30-60%
Final project report and presentation 40-60%
Scribing 10%
Class participation 5%

Code of academic integrity

Materials

There are no textbooks that we will follow for the course. We will provide pointers to various research articles and books as we navigate the course.

Assignments

Sep 5: Assignment 1 uploaded [pdf] Due: September 27

Lectures

Please use the following template for scribing the lectures: LaTeX Files.
Please send any errors to acharya AT cornell dot edu.

Aug 23: Introduction slides [pdf]
Aug 25: Statistical distances and concentration [pdf]
Aug 30: Minimax setting, learning discrete distributions, lower-bound for learning Bernoulli distributions [pdf]
Sep 01: Lower bound for general discrete distribution learning, basic information theory [pdf]
Elements of Information Theory. T. Cover, J. Thomas
Assouad, Fano, and Le Cam. B. Yu
Sep 06: Information theory basics, metric entropy [pdf]
Elements of Information Theory. T. Cover, J. Thomas. Chapter 2.
Combinatorial Methods in Density Estimation. L. Devroye, G. Lugosi
Sep 08: Metric entropy, Gaussian mixtures [pdf]
Combinatorial Methods in Density Estimation. L. Devroye, G. Lugosi
Faster and Sample Near-Optimal Algorithms for Proper Learning Mixtures of Gaussians. C. Daskalakis, G. Kamath.
Near-Optimal-Sample Estimators For Spherical Gaussian Mixtures. J. Acharya, A. Jafarpour, A. Orlitsky, and A. Suresh
Sep 13: Learning monotone and other shape restricted distributions
Sep 15: Robust Learning of distributions [pdf]
Robust Statistics. P. Huber, E. Ronchetti
Robust Estimators in High Dimensions without the Computational Intractability. I. Diakonikolas, G. Kamath, D. Kane, J. Li, A. Moitra, A. Stewart.
Sep 20: Missing mass and Good-Turing probability estimation [pdf]

On the convergence of Good Turing Estimators. D. McAllester, R. Schapire
Concentration Bounds for Unigram Language Models. E. Drukh, Y. Mansour
Sep 22: Universal Compression and Competitive Distribution Estimation

Unit 2

References:

A Survey on Distribution Testing, Clement Canonne [pdf]
Testing random variables for independence and identity. T. Batu, E. Fischer, L. Fortnow, R. Kumar, R. Rubinfeld, and P. White.
A coincidence-based test for uniformity given very sparsely-sampled discrete data. L. Paninski
Testing Shape Restrictions of Discrete Distributions. C. Canonne, I. Diakonikolas, T. Gouleakis, R. Rubinfeld
Optimal Testing for Properties of Distributions. J. Acharya C. Daskalakis, G. Kamath
A New Approach for Testing Properties of Discrete Distributions, I. Diakonikolas, D. Kane

Sep 27, 29: Distribution Property Testing, Uniformity [pdf]

ECE6980 An Algorithmic and Information-Theoretic Toolbox for Massive Data