## Session 12

### Let's Do a Puzzle!

This puzzle is called Elections. It should make you happy.

We have been commissioned to write a program for the next version of electronic voting software for Black Hawk County. For each office we'll be given the number of candidates, c, and a text file containing a list of v votes, one per line, in the order they were cast. Each vote consists of the candidate's number on the ballot, from 1 to c.

For example, we might be told that c = 6 and be given this list of votes:

```    3 5 1 5 5
```

Candidate 5 has a majority of 3 votes and is the winner.

Or, we might be told that c = 6 and be given this list of votes:

```    3 5 1 5 5 4
```

Now, no candidate has a majority, because Candidate 5's three votes is only 50%.

Write an algorithm to determine if there is a candidate with a majority of the votes. If so, output the candidate number, else output a 0.

What does your algorithm look like? What is its time complexity? Its space complexity?

### Debriefing Elections

Here is one simple brute-force solution:

```    INPUT : c, the number of candidates
vote[1..v]
OUTPUT: majority candidate, or 0

for i ← 1 to c do
counter[i] ← 0

for i ← 1 to v do
increment counter[ vote[i] ]

for i ← 1 to c do
if counter[i] > v/2
return i
return 0
```

This algorithm creates an array of size c and so is O(c) in space. Its time complexity depends on the relative sizes of c and v, because it has loops over both. It is O(max(c,v)). It is likely that c << v, so we can estimate that the algorithm is O(v) in time.

Here is another straightforward brute-force solution:

```    for c ← 1 to c do
for i ← 1 to v do
if vote[i] = c
return c + 1
return 0
```

How does this algorithm compare to the first one? With the nested loops, it is O(c*v) in time. How about space? O(1) -- just one counter, regardless of how big c and v are!

Notice that the second algorithm is much more efficient on space than the first, but much less efficient on time. What a wonderful example the trade-off between space and time!

As v and c grow, the viability of these algorithms decreases. Either the time requirement grows quickly, as in Algorithm 2, or we need too much space. We might not even be able to do Algorithm 1 in memory! That would also increase its effective time spent, because it would have to work off disk.

Could v and c actually grow large enough to matter? Imagine solving a problem that involved pixels in a set of images, molecules in space, etc...

### Elections, Part 2

Suppose:

• The number of votes cast in a general election in China is 300,000,000.
• The candidates are numbered from 1 to 1,000,000.
• Our computer doesn't support arrays this big.

(It's harder to imagine the last one these days, but humor me.)

Now our brute-force counting of votes won't work now, for lack of memory. So:

Write an algorithm to determine if there is a candidate with a majority, as before. Use as little space as possible. Doing I/O with such a long list of votes is expensive, too, so read through the input at most a few times.

Try to use a technique you have seen before: divide and conquer. That is, decompose the problem into smaller problems. Solve the smaller problems, and use their answers to produce a final answer. Take advantage of being able to make multiple passes over the input data.

### Breaking Up (The List) Is Hard To Do

Is this even possible? How? Here are a couple of ideas:

• Sort the stream of votes. Then do a final pass to find a majority vote getter.

The last pass is O(v) in time, but the presort is O(v*log v). For large v, this is expensive.

If space is at premium, then this algorithm causes another problem. We won't be able to hold all the votes in memory, so we would be sorting on disk. This almost certainly signals use of a merge sort (more on that later), which is O(v) in space -- with high practical constants due to the disk access.

Presorting is often a useful technique when designing algorithms, when the alternatives are more expensive than the sort itself. That's not the case here, though. So, while promising, this idea doesn't help us.

• Do a statistical sampling to select k candidates most likely to have a majority, or identify them based on domain knowledge. Then compute vote totals for each of the k. If none has a majority, fall back to another algorithm. You can't return failure, because one of the other c-k candidates may have won!

In the best case, this algorithm is O(v) in time, with two passes, and O(k) in space. On average, this algorithm can perform just as well, too, if our way of selecting k is reliable. But in the worst case, the algorithm is much worse: It does all this work, only to have to start over.

... or do we have to start over? Maybe we could look at the next most likely k candidates? This idea is the seed of a surprisingly good algorithm.

Here's the algorithm. It uses a top-down approach:

1. Break the candidates down into k << c sub-ranges. For each sub-range, count the total votes for all of its candidates. If some sub-range ki has a majority, then do Step 2. Otherwise, return 0.

2. Make a second pass through the input, counting the votes for each candidate in the range ki. If some candidate kij has a majority, then output kij, otherwise return 0.

What is the time complexity of this algorithm? O(v) in time, with two passes. What about its space complexity? We need to keep track of vote counts for k ranges in pass one and for c/k candidates in pass one. So the algorithm is O( max(k, c/k) ) in space.

What is the best value for k? We can minimize the maximum number of arrays by choosing k = sqrt (c)!

Today's .zip file contains a Java program that finds a majority using the range decomposition algorithm, along with a program that finds a majority using the first brute-force algorithm above. I even threw in a few test files, a couple of which are large-ish. Check them out...

### Divide and Conquer

Over the next few weeks of the course, we will discuss variations of top-down decomposition. As we have discussed now for several weeks, a top-down algorithm breaks a problem into one or more smaller problems on demand, solves each, and then generates an answer from these solutions.

The smaller problems can be:

• instances reduced by a constant value.   For example, in Session 3, we learned that, when counting the ways in which a game can reach its final score, the value games( p1:p2 ) can be computed from the values games( (p1-1):p2 ) and games( p1:(p2-1) ).

• instances reduced by a constant factor.   For example, in the mergesort algorithm, we sort an array v[1..n] by first sorting v[1..n/2] and v[n/2+1..n].

Some people call both of these kinds of algorithm "divide and conquer". We will call the latter divide-and-conquer and the former decrease-and-conquer. We will consider each for a couple of sessions.

How can we tell these apart? Divide-and-conquer decomposes a problem into sub-problems of roughly the same size. An algorithm that peels off only one value, say, splitting v[1..n] into v[1] and v[2..n] is decrease-and-conquer. The size of the different sub-problems usually plays an important role in determining the efficiency of the resulting algorithms.

### Binary Search

The prototypical divide-and-conquer algorithm is binary search, which finds a value in a sorted list.

```    INPUT : k, a target value
v[1..n], a sorted list of values
OUTPUT: i, the index of k in v, or failure

if n < 1
then fail

middle := v[n/2]
if k = middle
then return n/2
if k < middle
then return search[1..n/2-1]
else return search[n/2+1..n]
```

We could, of course let middle be any value such that 1 ≤ middlen, but we get our best performance with middle in the very middle: We eliminate half of the list on each pass.

Quick Exercise. Given this array:
```    3 14 27 31 39
```

Which target values require the largest number of comparisons for a successful search? What is the average number of comparisons for a successful search?

[ ... fill in an answer! ... ]. Note the integer arithmetic, and that on each call the array is renumbered 1..size.

If we want to ask the same questions about unsuccessful searches, what more do we need to know, or assume?

Note that even though binary search is top-down and has a natural recursive definition, we can easily implement the algorithm in an iterative fashion:

```    INPUT : k, a target value
v[1..n], a sorted list of values
OUTPUT: i, the index of k in v, or failure

left = 1
right = n
while left ≤ right
middle := (left + right)/2      # notice difference!
if k = v[middle]
then return middle
if k < v[middle]
then right = middle - 1
else left  = middle + 1
fail
```

Can we do binary search in an unsorted list? ... Certainly, but we cannot eliminate part of the list on each pass. A recursive implementation might have to work to the bottom of both subtrees and then filter the answer back to the top. The iterative solution doesn't make sense any more in this scenario.

Aside. ... except for breaking out of the recursive computation as soon as an answer is found. call/cc in Scheme!

What would the time complexity of such a search be? Space complexity?

Even when we want an iterative solution, starting with a recursive idea is often the best way to proceed. We learned something similar in our first couple of sessions: Even when we want a bottom-up or zoom-in solution, starting with a top-down idea may be the best way to proceed.

This true nowhere more so than when dealing with data that is represented in tree form, especially binary trees. As with lists, processing sorted trees sometimes lets us search more efficiently, whereas processing unsorted data usually requires an O(n) approach, whether we are searching or computing.

### Wrap Up

Eugene Wallingford ..... wallingf@cs.uni.edu ..... February 25, 2014