DATA STRUCTURES AND ALGORITHMS CS 3139
Lecture: February 25th, 1999
ANNOUNCEMENTS
-
HW #1 was returned. Can graded the non-programming, and Nhat graded
the programming portions.
-
HW #1 solutions available - We incorporated quality student solutions into
the official solutions
REVIEW
Table ADT
-
Consists of key/value pairs
-
Many ways to organize
-
One way is to use a fixed size array ==> Hash table
Hash Table Issues
-
Trade memory for running time
-
f(key) ==> array index ==> lookup
-
Hash table size
-
Hash function
-
Collisions: Separate chaining -- open addressing -- rehashing O(N)
TODAY
Priority queues
Performance of collision schemes
Extendable hashing
PRIORITY QUEUES
Recall queues: First in, first out scheme
Suppose something is more important, for example, the priority()
Priority queues
-
Insert in, and remove the thing with the highest priority (deletemin())
-
Important for greedy algorithms
SIMPLE IMPLEMENTATIONS
Linked list
-
insert at front O(1)
-
deletemin O(N)
Sorted linked list
-
insert O(N)
-
deletemin O(1)
BST
-
O(log N) for both insert and deletemin
-
Notice that we only delete the minimim node. Although this does affect
the balance, we still do things in O(logN)
-
BST is a bit of overkill: No need for certain operations
Priority Queue Presented
-
O(logN) worst-case insert and deletemin
-
O(1) average insertion time
-
No links required
Binary Heap
-
A completely filled binary tree
-
So regular in structure, we can represent it as an array!
-
Element i: Left child 2i -- right child 2i+1 -- parent floor(i/2)
-
No links equals speed
-
Need size in advance -- Not usually a problem
-
Heap-order property: Root is minimum. For each non-root node
x, the parent is less than or equal to the value of x.
-
Heap-order property guarantees fast deletemin
COLLISIONS
Performance of collision schemes
f(load) ==> load = #elements N/table size M = X
X large = more expensive to do finds, etc.
Separate Chaining
Unsuccessful search
-
N/M elements per table entry
-
Must search through all of them
-
Running time approximately X
Successful search
-
Must do at least 1 for match
-
(N-1)/M other nodes to look at
-
On average, we look at half of nodes
-
running time: 1+((N-1)/M)/2 which is approximately 1+X/2
Good idea to have table size equal the number of elements: X = 1
Open Addressing: X = 1 by definition
Random Probing
-
Fraction of empty cells = 1 - fraction of full cells = 1 - X = Chance of
getting an empty cell
-
Expected number of probes: 1/(1-X)
-
Insertion: Caveat is that X changes from 0 to current value, so earlier
insertions cheap and bring average down
-
Estimate of average insertion time: Formula listed on page 163 of
the text. We see that this is better than linear probing.
Linear Probing
Average cost of operations depend on how the data is clustered.
For example, if table is half full
-
Best case for 2N elements - every other slot full
-
Any unsuccessful search is 1 + (0+1+0+1...)/(2N) = 1/2
-
Worst case - first half is full, second half is empty
-
Any unsuccessful search is 1 + (N+(N-1)+(N-2)+...)/(2N) = N/4
Finding average number of probes for different cluster lengths:
-
Unsuccessful searches and insertions: 1/2(1+1/(1-X)^2)
-
Successful searches: 1/2(1+1/(1-X))
Landmark results of '62 Knuth report
-
Want X <= 0.5
-
X = 0.5 ==> 2.5 unsuccessful probes
-
X = 0.9 ==> 50 unsuccsesful probes
Quadradic Probing
-
Is a method that eliminates the clustering problem of linear probing
-
But bad if X > 0.5
-
If table more than half full, then we are not guaranteed to find an empty
cell
Double Hashing: Apply a second hash function to inputs.
Linear probing is simple, but gets bad quickly. Often, random
probing is better
BIRTHDAY PARADOX
Intuition: As X gets large, collisions increase.
X doesn't have to be too large to have collisions however!
Von Misses Birthday Paradox: If > 23 people in room, then >50%
chance two people have the same birthday
Q(N) = Probability that when we randomly toss person into table, there
are no collisions
P(N) = Probability of at least 1 collision
Q(N)+P(N) = 1
P(N) = 1 - Q(N)
Q(N): Q(1) = 1
Q(2) = 364/365
Q(3) = 1 * 364/365 * 363/365
Q(N) = Q(N-1) * (365-N+1)/365
= (365*364*...(365-N+1))/365^N
= 365!/(365^N(365-N)!)
P(N) = 1 - 365!/(365^N(365-N)!)
This is counter-intuitive but true! If a hash table is 10% full,
then >50% collision probability!
EXTENSIBLE HASHING
Gives us a way to deal with data too large to fit into memory
Minimizes disk accesses
Recall B-tree
-
Good to make as short as possible
-
Each node represents a disk access
-
Recall size determiend by number of keys in node
We can make root a hash table instead
-
element => hash(element) => which disk block
Two issues to keep in mind: Partial keys and chain limits
PARTIAL KEYS
Data
|
Hash
|
A
|
1010
|
B
|
0010
|
C
|
1001
|
D
|
0101
|
E
|
1010
|
F
|
0110
|
Take 2 bits
Take 1 bit
Take 0 bits
Trade-off: Size of directory vs. list size
CHAIN LIMITS
Given a chain limit, if our lists get too long, we can split the lists
up by applying another hash function. We extend the bits, hence this
technique is applied to
extensible hash tables.
Extensible Hash Table of order d
-
Directory of 2^d references (disk blocks)
-
Pages (lists) have up to L items
-
Items identical in first k bits
-
Directory contains 2^(d-k) pointers to the page
-
Doesn't work if move than L duplicates
-
Directory size (2^d) = number keys/L
Nhat Minh Dau, nmd13@columbia.edu