An Algorithmic and Information-Theoretic Toolbox for Massive Data

ECE6980
An Algorithmic and Information-Theoretic Toolbox for Massive Data

Logistics

Instructor: Jayadev Acharya, 304 Rhodes Hall
Lectures: TuTh 1325-1440 hours, 203 Phillips Hall
Office hours: MoTh 1500-1600 hours, 304 Rhodes Hall
Course description

Overview

Massive systems are being built to model, analyze, and make inference from data. These systems will be increasingly complex. We will consider some of the core primitives, which are integral to building such systems. The lectures will consist of three units:
  • Unit 1: Learning. Discrete distributions, shape restrictions, mixture models, Scheffe estimator, competitive learning, Good-Turing probability estimation.
  • Unit 2: Testing. Closeness of distributions, classification, testing identity, competitive testing, testing properties.
  • Unit 3: Estimation. Estimating Shannon and Rényi entropy of distributions, moments, support size, unseen elements.
In each unit, we will discuss the design of efficient algorithms, prove the information theoretic limits, and brainstorm on some future directions. We will also try to understand these problems under constraints on resources such as memory and communication.
The first few lectures will include preliminaries such as concentration inequalities, and basic information theory.

Grading

  • Assignments 30-60%
  • Final project report and presentation 40-60%
  • Scribing 10%
  • Class participation 5%
  • Follow the Code of academic integrity. Collaboration is allowed in the assignments, copying is not. Acknowledge the discussions and other sources used. Assignments should be submitted before the deadline.
    The project will consist of reading a recent research paper. Students will be expected to submit a write-up summarizing the paper, and give a short presentation.

Materials

There are no textbooks that we will follow for the course. We will provide pointers to various research articles and books as we navigate the course.

Assignments

  • Sep 5: Assignment 1 uploaded [pdf] Due: September 27

Lectures

Please use the following template for scribing the lectures: LaTeX Files.
Please send any errors to acharya AT cornell dot edu.
  • Aug 23: Introduction slides [pdf]
  • Aug 25: Statistical distances and concentration [pdf]
  • Aug 30: Minimax setting, learning discrete distributions, lower-bound for learning Bernoulli distributions [pdf]
  • Sep 01: Lower bound for general discrete distribution learning, basic information theory [pdf]

    Elements of Information Theory. T. Cover, J. Thomas
    Assouad, Fano, and Le Cam. B. Yu

  • Sep 06: Information theory basics, metric entropy [pdf]

    Elements of Information Theory. T. Cover, J. Thomas. Chapter 2.
    Combinatorial Methods in Density Estimation. L. Devroye, G. Lugosi

  • Sep 08: Metric entropy, Gaussian mixtures [pdf]

    Combinatorial Methods in Density Estimation. L. Devroye, G. Lugosi
    Faster and Sample Near-Optimal Algorithms for Proper Learning Mixtures of Gaussians. C. Daskalakis, G. Kamath.
    Near-Optimal-Sample Estimators For Spherical Gaussian Mixtures. J. Acharya, A. Jafarpour, A. Orlitsky, and A. Suresh

  • Sep 13: Learning monotone and other shape restricted distributions
  • Sep 15: Robust Learning of distributions [pdf]

    Robust Statistics. P. Huber, E. Ronchetti
    Robust Estimators in High Dimensions without the Computational Intractability. I. Diakonikolas, G. Kamath, D. Kane, J. Li, A. Moitra, A. Stewart.

  • Sep 20: Missing mass and Good-Turing probability estimation [pdf]

    On the convergence of Good Turing Estimators. D. McAllester, R. Schapire
    Concentration Bounds for Unigram Language Models. E. Drukh, Y. Mansour

  • Sep 22: Universal Compression and Competitive Distribution Estimation
  • Unit 2

    References:
    • A Survey on Distribution Testing, Clement Canonne [pdf]
    • A nice survey covering a lot of the testing related developments in recent years.
    • Testing random variables for independence and identity. T. Batu, E. Fischer, L. Fortnow, R. Kumar, R. Rubinfeld, and P. White.
    • A coincidence-based test for uniformity given very sparsely-sampled discrete data. L. Paninski
    • Testing Shape Restrictions of Discrete Distributions. C. Canonne, I. Diakonikolas, T. Gouleakis, R. Rubinfeld
    • Optimal Testing for Properties of Distributions. J. Acharya C. Daskalakis, G. Kamath
    • A New Approach for Testing Properties of Discrete Distributions, I. Diakonikolas, D. Kane
    • A beautiful paper providing a unified approach to testing many properties, and a nice information-theoretic lower bound!
  • Sep 27, 29: Distribution Property Testing, Uniformity [pdf]