猴年马月: NetApp的Phone Interview惨败（五）

9. 一个很大的整型数组（length possibly reaches millions），里面有duplicate，用怎样的方法可以快速地去除这些duplicate？

我感觉应该先把这个数组sort一下，但是之后该怎么办呢？完全没有思路。当时只提到了用quicksort排序先，没想到面试官接着又问quicksort的工作原理。

亏了我前晚在看1337coder的blog关于recursion的文章的时候顺带查了一下The C Programming Language, 2nd Edition, by K&R，大概是第87页上谈到了quicksort()的实现。The Practice of Programming by K&R的第32~34页上也谈到了其实现，还给出了算法的量级。

我就和他大致讲了一下quicksort的实现思路。不料他又继续追问这个算法的量级，这个TCPL没有提到，TPOP提到了，但我没有仔细看。结果随口报了个O(logN)的量级，他没做声。后来我翻书一查，才发现是O(NLogN)的量级。操，又挨了一刀。

OK，转入正题。究竟该怎样做才可以remove duplicates呢？下面是我在mitbbs上的求助和高手的应助：
－－－
一个很大的整型数组，里面有duplicate，用怎样的方法可以快速地去除这些duplicate？我感觉应该先把这个数组sort一下，但是之后该怎么办呢？完全没有思路。

by johny

bitmap比较好，当然sort也可以做。这个有个问题就是，可能in-place remove 所用的两个指针所指的位置不能同时装入内存，这个问题还没想好如何解决。

by Chevy

bitmap不是图形格式吗？能把数组变成一个bitmap？不太懂啊。能不能给段代码的例子？

by johny

bitmap你可以看看programming pearls的第一章

by Chevy

还有其他的方法吗？

by johny

我知道的就这些了。如果本来已经排序好了的话，可以用binary search，在O(logN)时间内找到一个。

by Chevy

binary search倒是可以用，但找到以后，还要把duplicate从这个大数组里面剔除，我所能想到的只有remove之后，再依次把后面的element向前挪一位依次填坑，但似乎这样太影performance。

当时面试官给的提示是再新建一个数组，似乎他的意思是把没有duplicate的element往新数组里面扔。而且他还提到的hashtable，这个地方该怎样应用hashtable呢？

by johny

你每次遇到一个数，先判断已经在不在hashtable中，在的话说明重复；不在的话，插入hashtable。

by Chevy

Remove duplicates 的 Sort 的方法如下：

/* Remove the duplicates in place, and returns
   the new size of the array. Assumes the array
   is already sorted. */
int removeDuplicates(int A[], int n) {
    int j = 0;

    for (int i = 0; i < n; i++) {
        if (A[i] != A[j])
            A[++j] = A[i];
    }

    return j+1;
}

by ihasleetcode

已经验证过了，确实可以实现in place remove duplicates. 顺便问一下，这段code在你的blog里面出现过吗？这是个有名的算法吗？

当时面试官给的提示是再新建一个数组，似乎他的意思是把没有duplicate的element往新数组里面扔。他还提到要用hashtable的方法，这个地方该怎样应用hashtable呢？据他说用hashtable的话，就用不着先对数组进行排序了。

我对hashtable没什么概念，只知道是个key-value pair的table。似乎还要定义一个hash function来建立从key到value之间的mapping关系。

如果用面试官所说的方法，该怎么解呢？可否给个example code？

by johny
－－－
还没有完全解开，但已经有了大致的思路。其一是参照Programming Pearl第一章去用所谓的“bitmap”；其二是“每次遇到一个数，先判断已经在不在hashtable中，在的话说明重复；不在的话，插入hashtable”，但关于hash的知识还得补；其三为in-place removal as 1337 exampled（当然要先quicksort一把）。

后续的帖子：
－－－
不要sorting的remove duplicate：

const int REMOVE = -1000;

int removeduplicate(int a[], int n) {
    unordered_map hash;

    for(int i = 0; i < n; i++) {
        if(hash.find(a[i]) != hash.end()) {
            a[i] = REMOVE;
        } else {
            hash[a[i]] = 1;
        }
    }

    int j=0;
    int i=0;

    while (i < n) {
        while (a[i] == REMOVE) {
            i++;
        }
        a[j] = a[i];
        i++;
        j++;
    }

    return j;
}

by icn9

谢谢啊。这两天我也在学习hashtable的方法，刚好看到C++里面有个map可以用。调试了一下你的程序，稍作调整后的code如下：

#include <iostream>
//#include <unordered_map>
#include <map>

using namespace std;

const int REMOVE = -1000;

int remove_duplicate(int a[], int n) {
    int i = 0, j = 0;
    //unordered_map hash;
    map hash;

    for (i = 0; i < n; ++i) {
        if (hash.find(a[i]) != hash.end()) {
            a[i] = REMOVE;
        } else {
            hash[a[i]] = 1;
        }
    }

    j = 0;
    i = 0;

    while (i < n) {
        while (a[i] == REMOVE) {
            i++;
        }
        a[j] = a[i];
        i++;
        j++;
    }

    return j;
}

int main()
{
    int arr[8] = {3, 2, 7, 4, 4, 6, 1, 5};
    int len = 0, i = 0;

    len = remove_duplicate(arr, sizeof(arr));
    cout << "The new size of arr[] is " << len << endl;

    for (i = 0; i < len; ++i)
        cout << arr[i] << "\t";
    cout << endl;

    getchar();
    return 0;
}

我在windows xp下的用的是Dev-C++，似乎比较老，因此而不支持，因此我把它改成map了。但编译通过后，跑出来的结果很奇怪，见附图。能帮我看一下是什么导致的吗？

by johny

You passed a wrong size of the array and accessed memory out of boundary:

    int arr[8] = {3, 2, 7, 4, 4, 6, 1, 5};
    ...
    len = remove_duplicate(arr, sizeof(arr));

When sizeof is applied to an array, the result is the size in bytes of the array in memory. Thus, it should be changed to:

    len = remove_duplicate(arr, sizeof(arr)/sizeof(int));

by recursive

多谢，确实是这个问题。我的疏忽。改成remove_duplicate(arr, sizeof(arr)/sizeof(arr[0]))了，编译通过，运行正常。多谢指出这个错误。

不过，Dev-C++似乎不支持unordered_map。这个模板类不在C++的标准库里面吗？但是我用了1337coder的coding panel，也是有通不过这个模板类的编译。

为什么呢？

by johny

你的问题不在于map和unordered_map的区别。
不过如果你要用到unordered_map，可以这样写：

    #include <tr1/unordered_map>
    ...
    tr1::unordered_map hash;

同时这两个东西是有区别的，你应该要能理解他们的不同。

by recursive

: 你的问题不在于map和unordered_map的区别。

那倒是。用unordered_map的想法是不想用map对数组重新排序。节省一点CPU开销，出于performance的考虑。不过对这个测试程序来说，也许用不着。

: 不过如果你要用到unordered_map，可以这样写：
:     #include <tr1/unordered_map>
:     ...
:     tr1::unordered_map hash;
: 同时这两个东西是有区别的，你应该要能理解他们的不同。

谢谢提醒啊。加上你提示的改变之后，1337的coding panel可以通过，并顺利跑起来了。但是Dev-C++的编译还是报错。看来应该是编译器有点老了（2005年最后一个release）。

你的意思是unordered_map与tr1::unordered_map有区别吗？

by johny
－－－
这个问题到这里基本告一段落，特别感谢Chevy、1337、icn9与recursive同学。

猴年马月

猴年马月

收藏夹

Blog Archive

Friday, March 4, 2011

NetApp的Phone Interview惨败（五）

No comments:

Post a Comment