Assignments: Algorithmic and Information-Theoretic Methods in Data Science

Homework problems

We will have a stream of homework problems, following every class. Since this is an advanced graduate level class, solving these problems right after class will (hopefully) help you understand the material better.

23rd August

Problem 1 (Total Variation (TV) Distance). Let $p$, and $q$ be distributions over a set $\mathcal{X}$. Their total variation distance is defined as \[ d_{\rm TV}(p,q) = \sup_{A\subseteq\mathcal{X}}\left( p(A)-q(A)\right), \] where $p(A) = \sum_{x\in A} p(x)$ (the sum is replaced by an integral when $\mathcal{X}$ is continuous).

(a) Let $B = \{x\in\mathcal{X}:p(x)>q(x)\}$. Show that $ d_{\rm TV}(p,q) = p(B) - q(B)$.

(b) Let $\ell_1(p, q) = \sum_{x\in\mathcal{X}}|p(x)-q(x)|$. Show that $ d_{\rm TV}(p,q) = \frac{1}{2} \cdot \ell_1(p, q)$.

Problem 2 (An Operational Meaning for TV). You know $p$ and $q$. Consider the following process:

Nature picks a distribution $r\in\{p,q\}$ with equal probability (you cannot see which).
You observe $X\sim r$, upon which your goal is to decide whether $r=p$ or $r=q$. Upon observing $x$ you decide to output $p$ with probability $c_x$, and $q$ with probability $1-c_x$. Call your output $r'$.
The error probability is $P_e(c) = {\rm Prob}(r'\ne r)$.

(a) Show that \[ P_e(c) = \frac{1}{2} +\frac{1}{2}\sum_{x\in\mathcal{X}}c_x\cdot (q(x)-p(x)). \]

(b) Show that the error is minimized when $c_x=1$ for $x\in B$, and $c_x=0$ for $x\notin B$ ($B$ was defined in the previous problem). Using this show that the minimum error possible is \[ P_e^* = \frac 12 - \frac 12 \cdot d_{\rm TV}(p,q). \]

28th August

Problem 3 (Binomial Variance Computations). If you are comfortable with this question, please skip it.

(a) Show that if $X,Y$ are independent random variables, ${\rm Var}(X+Y) = {\rm Var}(X)+{\rm Var}(Y)$, where ${\rm Var}(X) = \mathbb{E}[(X-\mathbb{E}[X])^2] = \mathbb{E}[X^2]-\mathbb{E}[X]^2$.

Suppose $X_1,\ldots, X_n$ are independent $\text{Bern}(p)$ random variables. Then $X = X_1+\ldots+X_n$ is a $\text{Bin}(n,p)$ random variable, namely ${\rm Pr}(X=k) = {n\choose k} p^k(1-p)^{n-k}$.

(a) Show that $\mathbb{E}[X] = np$, and ${\rm Var}(X) = n\cdot p(1-p)$.

Problem 4 (An Instance-by-Instance Learning Bound). In class we showed that the empirical distribution $\widehat p_n$ satisfies \[ \mathbb{E}[\ell_1(p,\widehat {p}_n)]\le \frac{\sqrt{k}}{\sqrt n}. \] However, for some distributions we do much better (eg, a distribution over one element). Let $\|p\|_r = \left(\sum_{x} {p(x)^r}\right)^{1/r}$. In particular, $ \|p\|_{1/2}^{1/2} = \sum_{x} {p(x)^{1/2}}$.

(a) Prove that \[ \mathbb{E}[\ell_1(p,\widehat {p}_n)]\le \frac{\|p\|_{1/2}^{1/2}}{\sqrt n}. \] It might be helpful to go back to the step where we applied Cauchy-Schwarz, and tweak things around.

(b) Show that \[ \|p\|_r^r = \sum_x p(x)^{1/2} \le \sqrt k, \] and equality holds for the uniform distribution. So, in some sense, the uniform distribution is the hardest to learn.

30th August

Problem 5 (Poisson Moment Computations).

(a) Let $X\sim {\rm Poi}(\lambda)$, and $r\in\mathbb{N}$. Then, \[ \mathbb{E}\left[X(X-1)\ldots(X-r+1)\right] = \lambda^r. \]

Using the above, show the following.

(b) For $X\sim {\rm Poi}(\lambda)$, ${\rm Var}(X) = \lambda$.

(c) Let $\mu\in\mathbb{R}$, and $X\sim {\rm Poi}(\lambda)$, then \[ \mathbb{E}\left[(X-\mu)^2-X\right] = (\lambda-\mu)^2, \] and \[ {\rm Var}\left((X-\mu)^2-X\right) = 2\lambda^2+4\lambda(\lambda-\mu)^2. \]

Problem 6 (Majority for Boosting Error Probability). In class we saw Markov's Inequality, and Chebychev's Inequality. Please browse through the concentration inequalities webpage on wikipedia here. We will use the following inequality, which you can assume to be true.

Suppose $X_1, \ldots, X_m$ are independent random variables such that $X_i\in\{0,1\}$. Let $\mu = \mathbb{E}[\sum_{i=1}^{m} X_i]$. Then for $0\le \gamma\le 1$, \[ {\rm Pr}\left(\left|(\sum_i X_i) - \mu\right|>\gamma\mu\right) \le2e^{-\gamma^2\mu/3}. \] Suppose $X_i$ are all $Bern(0.9)$ random variables, then show that \[ {\rm Pr}\left(\sum_i X_i <0.5m\right) \le2e^{-m/12}. \]

Using this, the goal is to show that if we can solve a testing problem with error probability at most 0.1 using $n$ independent samples, then we can solve the same testing problem using $O(n\log \frac{1}{\delta})$ samples and error probability at most $\delta$.
Consider the following algorithm.

Run the testing algorithm with $O(n)$ samples with error probability 0.1, for $24\log(1/\delta)$ times with independent samples.
Take the majority of the outcomes.

Prove that this algorithm takes $24n\log(1/\delta)$ samples and outputs the correct hypothesis with error probability at most $\delta$.

Problem 7 (KL Divergence). We considered total variationdistance between distributions. Another useful distance measure, whose relation to data compression, we will see next week is the KL divergence, defined as, \[ KL(p,q) = \sum_x p(x)\log\frac{p(x)}{q(x)}. \]

(a) Show that $KL(p,q)\ge0$.

(b) This is log-sum inequality. For positive reals, $a_1,\ldots,a_m$, and $b_1, \ldots, b_m$, prove that \[ \sum_{i=1}^{m} a_i\log \frac{a_i}{b_i} \ge \left(\sum_{i=1}^{m} a_i\right)\log \frac{\sum_{i=1}^{m} a_i}{\sum_{i=1}^{m} b_i}. \]

(c) Data processing. Let $f:\mathcal{X}\to\mathcal{Y}$ be a randomized function that maps an $x$ in $\mathcal{X}$ to $y$ in $\mathcal{Y}$ with probability $r(y|x)$. Let $p_f$ and $q_f$ be the distributions induced over $\mathcal{Y}$ via $f$ by distributions $p$ and $q$ over $\mathcal{X}$. Show that \[ KL(p,q)\ge KL(p_f, q_f). \]

(d) Convexity of Divergence. Let $p_1, p_2, q_1, q_2$ be distributions over $\mathcal{X}$, and $0\le\lambda\le1$. Let $p = \lambda p_1+ (1-\lambda)p_2$, and $q =\lambda q_1+ (1-\lambda)q_2$. Show that \[ KL(p,q) \le \lambda \cdot KL(p_1, q_1) +(1-\lambda)\cdot KL(p_2,q_2). \]

(e) This is Pinsker's inequality. Prove that \[ KL(p,q) \ge 2\cdot d_{\rm TV}(p,q)^2. \]

***** This concludes the problems for HW1. Please solve all problems. Submit solutions to 2, 4, 6, and 7 on 18th September in class. *****

Problem 8 (Information Theory Reading). Please read Chapter 2 of Elements of Information Theory, by Tom Cover, and John Thomas.

(a) Using the last problem show that $H(p)$ is a concave function of $p$.

(b) Design a distribution over $\mathbb{N}$ with infinite entropy.

Problem 9 (Hypothesis Testing TV Bounds). Suppose $p$, and $q$ are two distributions with $d_{\rm TV}(p,q)=\varepsilon$. Suppose we want to test between $p$ and $q$ with probability at least 0.9.

(a) Show that number of samples needed to do the testing is $\Omega(1/\varepsilon)$, and $O(1/\varepsilon^2)$.

(b) Design two examples of $(p,q)$ that match the two bounds.

Problem 10 ($\chi$-squared distance). Recall from class that \[ \chi^2(p,q) = \sum_{x\in\mathcal{X}} \frac{(p(x)-q(x))^2}{q(x)}. \]

(a) Show that $KL(p,q)\le \chi^2(p,q)$.

Problem 11 (Learning Gaussians). For Gaussian distributions $N_{\mu_1, \sigma_1^2}$, and $N_{\mu_1, \sigma_1^2}$,

(a)For Gaussian distributions $N_{\mu_1, \sigma_1^2}$, and $N_{\mu_1, \sigma_1^2}$, show that \[ KL(N_{\mu_1, \sigma_1^2}, N_{\mu_1, \sigma_1^2}) = \log\frac{\sigma_2}{\sigma_1}+\frac{\sigma_1^2+(\mu_1-\mu_2)^2}{2 \sigma_2^2}-\frac12. \]

(b) Show that the KL divergence, and TV distance between two $d$ dimensional Gaussian distributions with identity covariance matrix is a function of only the distance between their means.

(c) Show that we can learn a $d$-dimensional Gaussian distribution with identity covariance matrix using to KL divergence $\varepsilon^2$, and TV distance $\varepsilon$ using $O(d/\varepsilon^2)$ samples.

(d)* Show that any algorithm that can learn $d$-dimensional Gaussian distributions with identity covariance matrix requires $\Omega(d/\varepsilon^2)$ samples.

Problem 12 (Variance of empirical entropy estimation).

(a) Please go through these slides for McDiarmid's inequality.

(b) Consider the set-up to be the same as that of McDiarmid's ienquality. Then, \[ {\rm Var}(f(X_1,\ldots, X_n))\le \frac14 \cdot nc^2. \] We will not prove this result. However, using this result show that \[ {\rm Var}\left(H(\hat{p_n})\right) \le 25\frac{\log^2 n}{n}. \]

Problem 13 (Bias Variance Decomposition).

For an estimator $\hat{H}(X^n)$ of entropy, let $Bias(\hat{H}) = H(p)-\mathbb{E}[\hat{H}(X^n)]$. Show that \[ \mathbb{E}[(\hat{H}(X^n) -H(p))^2] =Bias(\hat{H})^2+{\rm Var}(\hat{H}(X^n)). \] This is a special case of bias-variance decomposition of mean squared error.

Problem 14 (Lower bounds for integer $\alpha$). In class we mentioned that the sample complexity of estimating $H_\alpha(p)$ is $\Theta(k^{1-1/\alpha}/\varepsilon^2)$. This assignment provides some arguments toward a lower bound.

LeCam's two point method is the most common technique to prove lower bounds on the sample complexity of testing and property estimation problems. The steps are as follows: (i) Design two collections of distributions $\mathcal{P}$, and $\mathcal{Q}$, such that $|f(p)-f(q)|>5\varepsilon$, for any $p\in\mathcal{P}$, and $q\in\mathcal{Q}$. (ii) Then the sample complexity of estimating $f$ is at least the sample complexity of distinguishing $\mathcal{P}$, and $\mathcal{Q}$.

(a) For $\alpha=2$, using the constructions of uniformity testing, show a lower bound of $\Omega(\sqrt{k}/\varepsilon)$ samples.

Recall that the lower bound involving Paninski's construction was tedious because of the prior over many distributions. We will now consider lower bounds for estimating $H_\alpha(p)$ that are much simpler for integral $\alpha$. In fact they are achieved when $\mathcal{P}$, and $\mathcal{Q}$ each have one distribution.

(b) Suppose $\mathcal{P} = \{u\}$, the uniform distribution over $[k+1]$, and $\mathcal{Q} = \{q\}$, be such that \[ q(1) = \frac{\varepsilon}{k^{1-1/\alpha}}, q(j) = \frac{1-q(1)}{k}, \text{ for $j=2,\ldots, k+1$.} \] What is $H_\alpha(u)-H_\alpha(q)$? Obtain a lower bound on the sample complexity of estimating $H_\alpha$ by showing a lower bound on the complexity of distinguishing $p$, and $u$.

(c) Suppose $\mathcal{P} = \{p\}$, and $\mathcal{Q} = \{q\}$, be distributions on $[k+1]$ such that \[ p(1) = \frac{1}{k^{1-1/\alpha}}, p(j) = \frac{1-p(1)}{k}, \text{ for $j=2,\ldots,k+1$.} \] and \[ q(1) = \frac{1+5\varepsilon}{k^{1-1/\alpha}}, q(j) = \frac{1-q(1)}{k}, \text{ for $j=2,\ldots,k+1$.} \] What is $H_\alpha(p)-H_\alpha(q)$? Obtain a lower bound on the sample complexity of estimating $H_\alpha$ by showing a lower bound on the complexity of distinguishing $p$, and $q$.

Algorithmic & Information Theoretic Methods in Data Science (ECE6980)

Course home

Homework problems

Credits