lecture4

DATA STRUCTURES AND ALGORITHMS CS 3139
Lecture: February 25th, 1999

ANNOUNCEMENTS

HW #1 was returned. Can graded the non-programming, and Nhat graded the programming portions.
HW #1 solutions available - We incorporated quality student solutions into the official solutions

REVIEW

Table ADT

Consists of key/value pairs
Many ways to organize
One way is to use a fixed size array ==> Hash table

Hash Table Issues

Trade memory for running time
f(key) ==> array index ==> lookup
Hash table size
Hash function
Collisions: Separate chaining -- open addressing -- rehashing O(N)

TODAY

Priority queues
Performance of collision schemes
Extendable hashing

PRIORITY QUEUES

Recall queues: First in, first out scheme
Suppose something is more important, for example, the priority()

Priority queues

Insert in, and remove the thing with the highest priority (deletemin())
Important for greedy algorithms

SIMPLE IMPLEMENTATIONS

Linked list

insert at front O(1)
deletemin O(N)

Sorted linked list

insert O(N)
deletemin O(1)

BST

O(log N) for both insert and deletemin
Notice that we only delete the minimim node. Although this does affect the balance, we still do things in O(logN)
BST is a bit of overkill: No need for certain operations

Priority Queue Presented

O(logN) worst-case insert and deletemin
O(1) average insertion time
No links required

Binary Heap

A completely filled binary tree
So regular in structure, we can represent it as an array!
Element i: Left child 2i -- right child 2i+1 -- parent floor(i/2)
No links equals speed
Need size in advance -- Not usually a problem
Heap-order property: Root is minimum. For each non-root node x, the parent is less than or equal to the value of x.
Heap-order property guarantees fast deletemin

COLLISIONS

Performance of collision schemes
f(load) ==> load = #elements N/table size M = X
X large = more expensive to do finds, etc.

Separate Chaining

Unsuccessful search

N/M elements per table entry
Must search through all of them
Running time approximately X

Successful search

Must do at least 1 for match
(N-1)/M other nodes to look at
On average, we look at half of nodes
running time: 1+((N-1)/M)/2 which is approximately 1+X/2

Good idea to have table size equal the number of elements: X = 1

Open Addressing: X = 1 by definition

Random Probing

Fraction of empty cells = 1 - fraction of full cells = 1 - X = Chance of getting an empty cell
Expected number of probes: 1/(1-X)
Insertion: Caveat is that X changes from 0 to current value, so earlier insertions cheap and bring average down
Estimate of average insertion time: Formula listed on page 163 of the text. We see that this is better than linear probing.

Linear Probing

Average cost of operations depend on how the data is clustered.
For example, if table is half full

Best case for 2N elements - every other slot full
Any unsuccessful search is 1 + (0+1+0+1...)/(2N) = 1/2
Worst case - first half is full, second half is empty
Any unsuccessful search is 1 + (N+(N-1)+(N-2)+...)/(2N) = N/4

Finding average number of probes for different cluster lengths:

Unsuccessful searches and insertions: 1/2(1+1/(1-X)^2)
Successful searches: 1/2(1+1/(1-X))

Landmark results of '62 Knuth report

Want X <= 0.5
X = 0.5 ==> 2.5 unsuccessful probes
X = 0.9 ==> 50 unsuccsesful probes

Quadradic Probing

Is a method that eliminates the clustering problem of linear probing
But bad if X > 0.5
If table more than half full, then we are not guaranteed to find an empty cell

Double Hashing: Apply a second hash function to inputs.

Linear probing is simple, but gets bad quickly. Often, random probing is better

BIRTHDAY PARADOX

Intuition: As X gets large, collisions increase.
X doesn't have to be too large to have collisions however!

Von Misses Birthday Paradox: If > 23 people in room, then >50% chance two people have the same birthday

Q(N) = Probability that when we randomly toss person into table, there are no collisions
P(N) = Probability of at least 1 collision

Q(N)+P(N) = 1
P(N) = 1 - Q(N)

Q(N):    Q(1) = 1
             Q(2) = 364/365
             Q(3) = 1 * 364/365 * 363/365
             Q(N) = Q(N-1) * (365-N+1)/365
                       = (365*364*...(365-N+1))/365^N
                       = 365!/(365^N(365-N)!)

P(N) = 1 - 365!/(365^N(365-N)!)

This is counter-intuitive but true! If a hash table is 10% full, then >50% collision probability!

EXTENSIBLE HASHING

Gives us a way to deal with data too large to fit into memory
Minimizes disk accesses

Recall B-tree

Good to make as short as possible
Each node represents a disk access
Recall size determiend by number of keys in node

We can make root a hash table instead

element => hash(element) => which disk block

Two issues to keep in mind: Partial keys and chain limits

PARTIAL KEYS

Data Hash

A 1010

B 0010

C 1001

D 0101

E 1010

F 0110

Take 2 bits

00 01 10 11

B D A

F C

E

Take 1 bit

0 1

B A

D C

F E

Take 0 bits

A

B

C

D

E

F

Trade-off: Size of directory vs. list size

CHAIN LIMITS

Given a chain limit, if our lists get too long, we can split the lists up by applying another hash function. We extend the bits, hence this technique is applied to
extensible hash tables.

Extensible Hash Table of order d

Directory of 2^d references (disk blocks)
Pages (lists) have up to L items
Items identical in first k bits
Directory contains 2^(d-k) pointers to the page
Doesn't work if move than L duplicates
Directory size (2^d) = number keys/L

Nhat Minh Dau, nmd13@columbia.edu