Wednesday, January 4, 2012

Median of Two Sorted Arrays

Source

There are two sorted arrays A and B of size m and n respectively. Find the median of the two sorted arrays. The overall run time complexity should be O(log (m+n)).

This problem turns out to be somewhat difficult and is non-trivial to implement correctly (at least for the general case). However, it had been asked in interviews by various big-named companies (Google, Microsoft, and Amazon), so at least get the idea well-understood.
Online Judge
This problem is available at Online Judge. Head over there and it will judge your solution. Currently only able to compile C++ code. If you are using other languages, you can still verify your solution by looking at the judge’s test cases and its expected output.
Solution:
If you search this problem on Google, you will find tons of hits. However, most of them deal with the special case where m == n, and even so their code are filled with bugs. The CLRS book has this problem as exercise in section 9.3-8, however it also assumes the case where m == n. The only reliable solution I found on the web which deals with the generic case also seemed incorrect, as their definition of the median is the single middle element (although their approach of using binary search is pretty neat). According to the definition of the median, if (m + n) is even, then the median should be the mean of the two middle numbers.
If you read my previous post: Find the k-th Smallest Element in the Union of Two Sorted Arrays, you know that this problem is somewhat similar. In fact, the problem of finding the median of two sorted arrays when (m + n) is odd can be thought of solving the special case where k=(m+n)/2. Although we can still apply the finding k-th smallest algorithm twice to find the two middle numbers when (m + n) is even, it is no more a desirable solution due to inefficiency.
You might ask: Why not adapt the previous solution to this problem? After all, the previous algorithm solves a more general case. Well, I’ve tried that and I didn’t consider the previous solution is easily adaptable to this problem. The main reason is because when (m + n) is even, the two middle elements might be located in the same array. This complicates the algorithm and many special cases have to be dealt in a case by case basis.
Similar to finding the k-th smallest, the divide and conquer method is a natural approach to this problem. First, we choose Ai and Bj (the middle elements of A and B) where i and j are defined as m/2 and n/2. We made an observation that if Ai <= Bj, then the median must be somewhere between Ai and Bj (inclusive). Therefore, we could dispose a total of i elements from left of Ai and a total of n-j-1 elements to the right of Bj. Please take extra caution not to dispose Ai or Bj, as we might need two middle values to calculate the median (it might also be possible that the two middle values are both in the same array). The case where Ai > Bj is similar.
Two sorted arrays A and B. i is chosen as m/2 and j is chosen as n/2. Ai and Bj are middle elements of A and B. If Ai < Bj, then the median must be between Ai and Bj (inclusive). Similarly with the opposite.
The main idea illustrated above is mostly right, however there is one more important invariant we have to maintain. It is entirely possible that the number of elements being disposed from each array is different. Look at the example above: If Ai <= Bj, two elements to the left of Ai and three elements to the right of Bj are being disposed. Notice that this is no longer a valid sub-problem, as both sub-array’s median is no longer the original median.
Therefore, an important invariant we have to maintain is:
The number of elements being disposed from each array must be the same.
This could be easily achieved by choosing the number of elements to dispose from each array to be (Warning: The below condition fails to handle an edge case, for more details see the EDIT section below):
k = min(i, n-j-1) when Ai <= Bj.                   <--- 1(a)
k = min(m-i-1, j) when Ai > Bj.                    <--- 1(b)
Figuring out how to subdivide the problem is actually the easy part. The hard part is figuring out the base case. (ie, when should we stop subdividing?)
It is obvious that when m=1 or n=1, you must treat it as a special base case, or else it would end up in an infinite loop. The hard part is reasoning why m=2 or n=2 requires special case handling as well. (Hint: The two middle elements might be in the same array.)
Finally, implementing the above idea turns out to be an extremely tricky coding exercise. Before looking at the solution below, try to challenge yourself by coding the algorithm.
If you have a more elegant code to this problem, I would love to hear from you!
EDIT:
Thanks to Algorist for being the first person who points out a bug. (For more details, read his comment). The bug is caused by some edge cases that are not handled in the base case.
Shortly after I fixed that bug, I discovered another edge case myself which my previous code failed to handle.
An example of one of the edge cases is:
A = { 1, 2, 4, 8, 9, 10 }
B = { 3, 5, 6, 7 }
The above conditions ( 1(a), 1(b) ) fails to handle the above edge case, which returns 5 as the median while the correct answer should be 5.5.
The reason is because the number 5 is discarded in the first iteration, while it should be considered in the final evaluation step of the median. To resolve this edge case, we have to be careful not to discard the neighbor element when its size is even. Here are the corrected conditions ( 2(a), 2(b), 2(c), 2(d) ) for k which resolves this edge case.
k = min(i-1, n-j-1) when Ai <= Bj and m is even.   <--- 2(a)
k = min(i, n-j-1)   when Ai <= Bj and m is odd.    <--- 2(b)
k = min(m-i-1, j-1) when Ai > Bj  and n is even.   <--- 2(c)
k = min(m-i-1, j)   when Ai > Bj  and n is odd.    <--- 2(d)
Below is the bug-free code after going through a lengthy rigorous testing of all possible edge cases. (Not for the faint of heart!)
double findMedianBaseCase(int med, int C[], int n) {
  if (n == 1)
    return (med+C[0])/2.0;
 
  if (n % 2 == 0) {
    int a = C[n/2 - 1], b = C[n/2];
    if (med <= a)
      return a;
    else if (med <= b)
      return med;
    else /* med > b */
      return b;
  } else {
    int a = C[n/2 - 1], b = C[n/2], c = C[n/2 + 1];
    if (med <= a)
      return (a+b) / 2.0;
    else if (med <= c)
      return (med+b) / 2.0;
    else /* med > c */
      return (b+c) / 2.0;
  }
}
 
double findMedianBaseCase2(int med1, int med2, int C[], int n) {
  if (n % 2 == 0) {
    int a = (((n/2-2) >= 0) ? C[n/2 - 2] : INT_MIN);
    int b = C[n/2 - 1], c = C[n/2];
    int d = (((n/2 + 1) <= n-1) ? C[n/2 + 1] : INT_MAX);
    if (med2 <= b)
      return (b+max(med2,a)) / 2.0;
    else if (med1 <= b)
      return (b+min(med2,c)) / 2.0;
    else if (med1 >= c)
      return (c+min(med1,d)) / 2.0;
    else if (med2 >= c)
      return (c+max(med1,b)) / 2.0;
    else  /* a < med1 <= med2 < b */
      return (med1+med2) / 2.0;
  } else {
    int a = C[n/2 - 1], b = C[n/2], c = C[n/2 + 1];
    if (med1 >= b)
      return min(med1, c);
    else if (med2 <= b)
      return max(med2, a);
    else  /* med1 < b < med2 */
      return b;
  }
}
 
double findMedianSingleArray(int A[], int n) {
  assert(n > 0);
  return ((n%2 == 1) ? A[n/2] : (A[n/2-1]+A[n/2])/2.0);
}
 
double findMedianSortedArrays(int A[], int m, int B[], int n) {
  assert(m+n >= 1);
  if (m == 0)
    return findMedianSingleArray(B, n);
  else if (n == 0)
    return findMedianSingleArray(A, m);
  else if (m == 1)
    return findMedianBaseCase(A[0], B, n);
  else if (n == 1)
    return findMedianBaseCase(B[0], A, m);
  else if (m == 2)
    return findMedianBaseCase2(A[0], A[1], B, n);
  else if (n == 2)
    return findMedianBaseCase2(B[0], B[1], A, m);
 
  int i = m/2, j = n/2, k;
  if (A[i] <= B[j]) {
    k = ((m%2 == 0) ? min(i-1, n-j-1) : min(i, n-j-1));
    assert(k > 0);
    return findMedianSortedArrays(A+k, m-k, B, n-k);
  } else {
    k = ((n%2 == 0) ? min(m-i-1, j-1) : min(m-i-1, j));
    assert(k > 0);
    return findMedianSortedArrays(A, m-k, B+k, n-k);
  }
}
EDIT2:
A reader buried.shopno had managed to code the solution more elegantly! I especially like how medianOfThree and medianOfFour were implemented. For more details, read his comment below. Great job!
Further thoughts:
A reader nimin98 suggested that the base case can be handled by simply doing a direct merge. In other words, we have to merge the short array (containing either one or two elements) with the longer array (pick the four elements near the middle. Deciding which four is another tricky business because of multiple special cases). nimin98′s code has few bugs in the handling of base case.
In general, The above approaches (including mine) to handle the base case are not recommended due to tricky implementation. How about Binary Search? We can use binary search to find the correct position to insert elements from the shorter array into the longer array, thus completing the merge (You don’t have to *actually* insert it, recording its index should be suffice).
\

No comments:

Post a Comment