# Greedy Algorithm

## Interval Scheduling

• Job j starts at $s_j$ and finishes at $f_j$.
• Two jobs compatible if they don't overlap.
• Goal: find maximum subset of mutually compatible jobs.

Greedy algorithm.

Consider jobs in increasing order of finish time.Take each job provided it's compatible with the ones already taken.

$$\text{Sort jobs by finish times so that }f_1 < f_2 < ... < f_n.\A \leftarrow \emptyset\for, j = 1\text{to n} { \qquad if (job j compatible with A)\qquad A \leftarrow A \cup {j}}\return A$$

Implementation. $O(n log n)$.

• Remember job j* that was added last to A.
• Job j is compatible with A if $s_j < f_{j^*}$.

Analysis:

Theorem. Greedy algorithm is optimal.

Assume greedy is not optimal, and let's see what happens.

Let $i_1, i_2, ... i_k$ denote set of jobs selected by greedy.

Let $j_1, j_2, ... j_m$ denote set of jobs in the optimal solution with $i_1 = j_1, i_2 = j_2, ..., i_r = j_r$ for the largest possible value of r.

## Interval Partitioning

• Lecture j starts at $s_j$ and finishes at $f_j$.
• Goal: find minimum number of classrooms to schedule all lectures so that no two occur at the same time in the same room.

Def. The depth of a set of open intervals is the maximum number that contain any given time.

Greedy algorithm

Consider lectures in increasing order of start time: assign lecture to any compatible classroom

$$\text{Sort intervals by starting time so that} s1 < s2 < ... < sn.\d \leftarrow 0\for, j = 1, to, n, {\qquad \text{if (lecture j is compatible with some classroom k)}\qquad \qquad \text{schedule lecture j in classroom k}\qquad else\qquad \qquad \text{allocate a new classroom d + 1}\qquad \qquad \text{schedule lecture j in classroom d + 1}\qquad \qquad d \leftarrow d + 1}$$

Implementation. $O(n log n)$.

• For each classroom k, maintain the finish time of the last job added.
• Keep the classrooms in a priority queue.

Analysis:

Greedy algorithm never schedules two incompatible lectures in the same classroom.

**Theorem. Greedy algorithm is optimal.**Pf.

• Let d = number of classrooms that the greedy algorithm allocates.
• Classroom d is opened because we needed to schedule a job, say j, that is incompatible with all d-1 other classrooms.
• These d jobs each end after $s_j$.
• Since we sorted by start time, all these incompatibilities are caused by lectures that start no later than $s_j$.
• Thus, we have d lectures overlapping at time $s_j$.
• Key observation => all schedules use >= d classrooms. ▪

## Scheduling to Minimize Lateness

• Single resource processes one job at a time.
• Job j requires $t_j$ units of processing time and is due at time $d_j$.
• If j starts at time $s_j$, it finishes at time $f_j= s_j + t_j$.
• Lateness: $l_j = max{0, fj - dj}$.
• Goal: schedule all jobs to minimize maximum lateness $L = max, l_j$.

Greedy Algorithms：Earliest deadline first. 仍然是按照 DDL 先后排序

$$\text{Sort n jobs by deadline so that} d_1 < d2 < … < d_n\t \leftarrow 0\for, j = 1, to, n\qquad \text{Assign job j to interval} [t, t + t_j]\qquad s_j \leftarrow t, f_j \leftarrow t + t_j\qquad t \leftarrow t + t_j\text{output intervals} [s_j, f_j]$$

Observation. There exists an optimal schedule with no idle time.

Observation. The greedy schedule has no idle time.

Def.

Inversion: Given a schedule S, an inversion is a pair of jobs i and j such that: i < j but j scheduled before i.

Observation. Greedy schedule has no inversions.Observation. If a schedule (with no idle time) has an inversion, it has one with a pair of inverted jobs scheduled consecutively.

Claim. Swapping two consecutive, inverted jobs reduces the number of inversions by one and does not increase the max lateness

Pf. Let $l$ be the lateness before the swap, and let $l'$ be it afterwards.

• $l'_k = l_k$ for all $k \neq i, j$
• $l'_i \leq l_i$
• If job j is late:

## Optimal Caching

Caching.

• Cache with capacity to store k items.
• Sequence of m item requests $d_1, d_2, …, d_m$.
• Cache hit: item already in cache when requested.
• Cache miss: item not already in cache when requested: must bring requested item into cache, and evict some existing item, if full.

Goal. Eviction schedule that minimizes number of cache misses.

Farthest-in-future: Evict item in the cache that is not requested until farthest in the future

Theorem. [Bellady, 1960s] FF is optimal eviction schedule.Pf. Algorithm and theorem are intuitive; proof is subtle.

Def. A reduced schedule is a schedule that only inserts an item intothe cache in a step in which that item is requested.

Intuition. Can transform an unreduced schedule into a reduced onewith no more cache misses.

Claim. Given any unreduced schedule S, can transform it into a reducedschedule S' with no more cache misses.

Pf. (by induction on number of unreduced items)

Suppose S brings d into the cache at time t, without a request.

Let c be the item S evicts when i t brings d into the cache.

• Case 1: d evicted at time t', before next request for d.
• Case 2: d requested at time t' before d is evicted. ▪

Theorem. FF is optimal eviction algorithm

Pf: // TODO: p30 greedy

## Minimum Spanning Tree

Given a connected graph G = (V, E) with real-valued edge weights $c_e$, an MST is a subset of the edges T  E such that T is a spanning tree whose sum of edge weights is minimized.

### Kruskal's algorithm.

Start with $T = \emptyset$. Consider edges in ascending order of cost.

Insert edge e in T unless doing so would create a cycle.

Implementation. Use the union-find data structure.

• Build set T of edges in the MST.
• Maintain set for each connected component.
• O(m log n) for sorting and O(m  (m, n)) for union-find.

### Reverse-Delete algorithm.

Start with T = E. Consider edges in descending order of cost.

Delete edge e from T unless doing so would disconnect T.

### Prim's algorithm

Start with some root node s and greedily grow a tree T from s outward.

At each step, add the cheapest edge e to T that has exactly one endpoint in T.

Implementation. Use a priority queue ala Dijkstra.

• Maintain set of explored nodes S.
• For each unexplored node v, maintain attachment cost a[v] = cost of cheapest edge v to a node in S.
• O(n2) with an array; O(m log n) with a binary heap.

Def:

Cut property:

Let S be any subset of nodes, and let e be the min cost edge with exactly one endpoint in S. Then the MST contains e.

Cycle property:

Let C be any cycle, and let f be the max cost edge belonging to C. Then the MST does not contain f.

Cycle

Set of edges the form a-b, b-c, c-d, …, y-z, z-a.

Cutset

A cut S is a subset of nodes. The corresponding cutset D is the subset of edges with exactly one endpoint in S

Claim. A cycle and a cutset intersect in an even number of edges.

### 优化

Kruskal 和 Prim 都只能计算权重不同的情况，若存在权重相同的边，则在比较阶段优化即可：

## Clustering

k-clustering: Divide objects into k non-empty groups.

Distance function: Assume it satisfies several natural properties.

• $d(p_i, p_j) = 0$ iff $p_i = p_j$ (identity of indiscernibles)
• $d(p_i, p_j) \geq 0$ (nonnegativity)
• $d(p_i, p_j) = d(p_j, p_i)$ (symmetry)

Spacing: Min distance between any pair of points in different clusters.

Clustering of maximum spacing: Given an integer k, find a k-clustering of maximum spacing.

Form a graph on the vertex set U, corresponding to n clusters.

Find the closest pair of objects such that each object is in a different cluster, and add an edge between them to merge these two clusters into a new cluster.

Repeat n-k times until there are exactly k clusters.

Key observation. This procedure is precisely Kruskal's algorithm (except we stop when there are k connected components).

Remark. Equivalent to finding an MST and deleting the k-1 most expensive edges.

## Huffman Codes

T(n) = T(n-1) + O(n)

so time complexity is $O(N^2)$

Using priority queue: $T(n) = T(n-1) + O(logn) => O(NlogN)$

Claim. Huffman code for S achieves the minimum ABL of any prefix code.

**Pf. (by induction)**Base: For n=2 there is no shorter code than root and two leaves.Hypothesis: Suppose Huffman tree T’ for S’ with $\omega$ instead of y and z is optimal. (IH)Step: (by contradiction)

• Suppose Huffman tree T for S is not optimal.
• So there is some tree Z such that ABL(Z) < ABL(T).
• Then there is also a tree Z for which leaves y and z exist that aresiblings and have the lowest frequency (see observation).
• Let Z’ be Z with y and z deleted, and their former parent labeled $\omega$.
• Similar T’ is derived from S’ in our algorithm.
• We know that $ABL(Z’)=ABL(Z)-f_\omega$, as well as $ABL(T’)=ABL(T)-f_\omega$.
• But also ABL(Z) < ABL(T), so ABL(Z’) < ABL(T’).