《海量数据处理》

2019年11月12日 195次阅读

怎样从10亿查询词中找出出现频率最高的10个

http://dongxicheng.org/big-data/select-ten-from-billions/《蕫的博客》

TopK类问题：分治 + trie树/hash + 小顶堆

固定方法

2^32 = 4G种IP地址取值情况

不完全加载到内存中处理

分而治之：Hash(IP)%1024 将海量IP日志分到1024个小文件，每个文件中构建IP为KEY。出现次数为VALUE的Hash_map。同时记录出现次数最多的那个IP地址。

BitSet数据结构以及jdk中实现源码分析

一. Bitset 基础

Bitset，也就是位图，由于可以用非常紧凑的格式来表示给定范围的连续数据而经常出现在各种算法设计中。上面的图来自c++库中bitset的一张图。

基本原理是，用1位来表示一个数据是否出现过，0为没有出现过，1表示出现过。使用用的时候既可根据某一个是否为0表示此数是否出现过。

一个1G的空间，有 8*1024*1024*1024=8.58*10^9bit，也就是可以表示85亿个不同的数。

常见的应用是那些需要对海量数据进行一些统计工作的时候，比如日志分析等。

面试题中也常出现，比如：统计40亿个数据中没有出现的数据，将40亿个不同数据进行排序等。

又如：现在有1千万个随机数，随机数的范围在1到1亿之间。现在要求写出一种算法，将1到1亿之间没有在随机数中的数求出来(百度)。

programming pearls上也有一个关于使用bitset来查找电话号码的题目。

Bitmap的常见扩展，是用2位或者更多为来表示此数字的更多信息，比如出现了多少次等。

二. java中bitset的实现

Bitset这种结构虽然简单，实现的时候也有一些细节需要主要。其中的关键是一些位操作，比如如何将指定位进行反转、设置、查询指定位的状态（0或者1）等。

本文，分析一下java中bitset的实现，抛砖引玉，希望给那些需要自己设计位图结构的需要的程序员有所启发。

Bitmap的基本操作有：

初始化一个bitset，指定大小。
清空bitset。
反转某一指定位。
设置某一指定位。
获取某一位的状态。
当前bitset的bit总位数。

1. 声明

在java中，bitset的实现，位于java.util这个包中，从jdk 1.0就引入了这个数据结构。在多个jdk的演变中，bitset也不断演变。本文参照的是jdk 7.0 源代码中的实现。

声明如下：

[java] view plain copy print ?

package java.util;
import java.io.*;
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
import java.nio.LongBuffer;
public class BitSet implements Cloneable, java.io.Serializable {、

[java] view plain copy print ?

private long[] words;
….
….

同时我们也看到使用long数组来作为内部存储结构。这个决定了，Bitset至少为一个long的大小。下面的构造函数中也会有所体现。

2. 初始化函数

[java] view plain copy print ?

<pre name=“code” class=“java”> public BitSet() {
initWords(BITS_PER_WORD);
sizeIsSticky = false;
}
public BitSet(int nbits) {
// nbits can’t be negative; size 0 is OK
if (nbits < 0)
throw new NegativeArraySizeException(“nbits < 0: “ + nbits);
initWords(nbits);
private void initWords(int nbits) {
words = new long[wordIndex(nbits-1) + 1];
}
private static int wordIndex(int bitIndex) {
return bitIndex >> ADDRESS_BITS_PER_WORD;
}
private final static int ADDRESS_BITS_PER_WORD = 6;
private final static int BITS_PER_WORD = 1 << ADDRESS_BITS_PER_WORD;</pre><br>
两个构造函数，分别是一个指定了初始大小，一个没指定。如果没指定，我们可以看到默认的初始大小为, 2^6 = 64–1=63 bit. 我们知道java中long的大小就是8个字节，也就是8*8=64bit。也就是说，bitset默认的是一个long整形的大小。初始化函数指定了必要的大小。<br>
<strong>注意：</strong>如果指定了bitset的初始化大小，那么会把他规整到一个大于或者等于这个数字的64的整倍数。比如64位，bitset的大小是1个long，而65位时，bitset大小是2个long，即128位。做这么一个规定，主要是为了内存对齐，同时避免考虑到不要处理特殊情况，简化程序。
<pre></pre>
<pre></pre>
<pre></pre>
<pre></pre>

3. 清空bitset

a. 清空所有的bit位，即全部置0。通过循环方式来以此以此置0。如果是c语言，使用memset会不会快点？

[java] view plain copy print ?

public void clear() {
while (wordsInUse > 0)
words[–wordsInUse] = 0;
}

b. 清空某一位

[java] view plain copy print ?

public void clear(int bitIndex) {
if (bitIndex < 0)
throw new IndexOutOfBoundsException(“bitIndex < 0: “ + bitIndex);
int wordIndex = wordIndex(bitIndex);
if (wordIndex >= wordsInUse)
return;
words[wordIndex] &= ~(1L << bitIndex);
recalculateWordsInUse();
checkInvariants();
}

第一行是参数检查，如果bitIndex小于0，则抛参数非法异常。后面执行的是bitset中操作中经典的两步曲：a. 找到对应的long b. 操作对应的位。
a. 找到对应的long。这行语句是 int wordIndex = wordIndex(bitIndex);
b. 操作对应的位。常见的位操作是通过与特定的mask进行逻辑运算来实现的。因此，首先获取 mask（掩码）。
    对于 clear某一位来说，它需要的掩码是指定位为0，其余位为1，然后与对应的long进行&运算。
   ~(1L << bitIndex);  即获取mask
words[wordIndex] &= ; 执行相应的运算。
注意：这里的参数检查，对负数index跑出异常，对超出大小的index，不做任何操作，直接返回。具体的原因，有待进一步思考。

c. 清空指定范围的那些bits

[java] view plain copy print ?

/**
* Sets the bits from the specified {@code fromIndex} (inclusive) to the
* specified {@code toIndex} (exclusive) to {@code false}.
*
* @param fromIndex index of the first bit to be cleared
* @param toIndex index after the last bit to be cleared
* @throws IndexOutOfBoundsException if {@code fromIndex} is negative,
* or {@code toIndex} is negative, or {@code fromIndex} is
* larger than {@code toIndex}
* @since 1.4
*/
public void clear(int fromIndex, int toIndex) {
checkRange(fromIndex, toIndex);
if (fromIndex == toIndex)
return;
int startWordIndex = wordIndex(fromIndex);
if (startWordIndex >= wordsInUse)
return;
int endWordIndex = wordIndex(toIndex – 1);
if (endWordIndex >= wordsInUse) {
toIndex = length();
endWordIndex = wordsInUse – 1;
}
long firstWordMask = WORD_MASK << fromIndex;
long lastWordMask = WORD_MASK >>> -toIndex;
if (startWordIndex == endWordIndex) {
// Case 1: One word
words[startWordIndex] &= ~(firstWordMask & lastWordMask);
} else {
// Case 2: Multiple words
// Handle first word
words[startWordIndex] &= ~firstWordMask;
// Handle intermediate words, if any
for (int i = startWordIndex+1; i < endWordIndex; i++)
words[i] = 0;
// Handle last word
words[endWordIndex] &= ~lastWordMask;
}
recalculateWordsInUse();
checkInvariants();
}

方法是将这个范围分成三块，startword;interval words; stopword。
其中startword，只要将从start位到该word结束位全部置0；intervalwords则是这些long的所有bits全部置0；而stopword这是从起始位置到指定的结束位全部置0。
而特殊情形则是没有startword和stopword是同一个long。
具体的实现，参照代码，是分别作出两个mask，对startword和stopword进行操作。

4. 重要的两个内部检查函数

从上面的代码，可以看到每个函授结尾都会有两个函数,如下：
recalculateWordsInUse();
checkInvariants();
这两个函数，是对bitset的内部状态进行维护和检查的函数。细看实现既可明白其中原理：

[java] view plain copy print ?

/**
* Sets the field wordsInUse to the logical size in words of the bit set.
* WARNING:This method assumes that the number of words actually in use is
* less than or equal to the current value of wordsInUse!
*/
private void recalculateWordsInUse() {
// Traverse the bitset until a used word is found
int i;
for (i = wordsInUse-1; i >= 0; i–)
if (words[i] != 0)
break;
wordsInUse = i+1; // The new logical size
}

wordsInUse 是检查当前的long数组中，实际使用的long的个数，即long[wordsInUse-1]是当前最后一个存储有有效bit的long。这个值是用于保存bitset有效大小的。

[java] view plain copy print ?

/**
* Every public method must preserve these invariants.
*/
private void checkInvariants() {
assert(wordsInUse == 0 || words[wordsInUse – 1] != 0);
assert(wordsInUse >= 0 && wordsInUse <= words.length);
assert(wordsInUse == words.length || words[wordsInUse] == 0);
}

checkInvariants 可以看出是检查内部状态，尤其是wordsInUse是否合法的函数。

5. 反转某一个指定位

反转，就是1变成0,0变成1，是一个与1的xor操作。

[java] view plain copy print ?

/**
* Sets the bit at the specified index to the complement of its
* current value.
*
* @param bitIndex the index of the bit to flip
* @throws IndexOutOfBoundsException if the specified index is negative
* @since 1.4
*/
public void flip(int bitIndex) {
if (bitIndex < 0)
throw new IndexOutOfBoundsException(“bitIndex < 0: “ + bitIndex);
int wordIndex = wordIndex(bitIndex);
expandTo(wordIndex);
words[wordIndex] ^= (1L << bitIndex);
recalculateWordsInUse();
checkInvariants();
}

反转的基本操作也是两步，找到对应的long，获取mask并与指定的位进行xor操作。
int wordIndex = wordIndex(bitIndex);
words[wordIndex] ^= (1L << bitIndex);
我们注意到在进行操作之前，执行了一个函数 expandTo(wordIndex); 这个函数是确保bitset中有对应的这个long。如果没有的话，就对bitset中的long数组进行扩展。扩展的策略，是将当前的空间翻一倍。
代码如下：

[java] view plain copy print ?

/**
* Ensures that the BitSet can accommodate a given wordIndex,
* temporarily violating the invariants. The caller must
* restore the invariants before returning to the user,
* possibly using recalculateWordsInUse().
* @param wordIndex the index to be accommodated.
*/
private void expandTo(int wordIndex) {
int wordsRequired = wordIndex+1;
if (wordsInUse < wordsRequired) {
ensureCapacity(wordsRequired);
wordsInUse = wordsRequired;
}
}
/**
* Ensures that the BitSet can hold enough words.
* @param wordsRequired the minimum acceptable number of words.
*/
private void ensureCapacity(int wordsRequired) {
if (words.length < wordsRequired) {
// Allocate larger of doubled size or required size
int request = Math.max(2 * words.length, wordsRequired);
words = Arrays.copyOf(words, request);
sizeIsSticky = false;
}
}

同样，也提供了一个指定区间的反转，实现方案与clear基本相同。代码如下：

[java] view plain copy print ?

public void flip(int fromIndex, int toIndex) {
checkRange(fromIndex, toIndex);
if (fromIndex == toIndex)
return;
int startWordIndex = wordIndex(fromIndex);
int endWordIndex = wordIndex(toIndex – 1);
expandTo(endWordIndex);
long firstWordMask = WORD_MASK << fromIndex;
long lastWordMask = WORD_MASK >>> -toIndex;
if (startWordIndex == endWordIndex) {
// Case 1: One word
words[startWordIndex] ^= (firstWordMask & lastWordMask);
} else {
// Case 2: Multiple words
// Handle first word
words[startWordIndex] ^= firstWordMask;
// Handle intermediate words, if any
for (int i = startWordIndex+1; i < endWordIndex; i++)
words[i] ^= WORD_MASK;
// Handle last word
words[endWordIndex] ^= lastWordMask;
}
recalculateWordsInUse();
checkInvariants();
}

6. 设置某一指定位（or 操作）

[java] view plain copy print ?

/**
* Sets the bit at the specified index to {@code true}.
*
* @param bitIndex a bit index
* @throws IndexOutOfBoundsException if the specified index is negative
* @since JDK1.0
*/
public void set(int bitIndex) {
if (bitIndex < 0)
throw new IndexOutOfBoundsException(“bitIndex < 0: “ + bitIndex);
int wordIndex = wordIndex(bitIndex);
expandTo(wordIndex);
words[wordIndex] |= (1L << bitIndex); // Restores invariants
checkInvariants();
}

思路与flip是一样的，只是执行的是与1的or操作。
同时jdk中提供了，具体设置成0或1的操作，以及设置某一区间的操作。

[java] view plain copy print ?

public void set(int bitIndex, boolean value) {
if (value)
set(bitIndex);
else
clear(bitIndex);
}

7. 获取某一位置的状态

[java] view plain copy print ?

/**
* Returns the value of the bit with the specified index. The value
* is {@code true} if the bit with the index {@code bitIndex}
* is currently set in this {@code BitSet}; otherwise, the result
* is {@code false}.
*
* @param bitIndex the bit index
* @return the value of the bit with the specified index
* @throws IndexOutOfBoundsException if the specified index is negative
*/
public boolean get(int bitIndex) {
if (bitIndex < 0)
throw new IndexOutOfBoundsException(“bitIndex < 0: “ + bitIndex);
checkInvariants();
int wordIndex = wordIndex(bitIndex);
return (wordIndex < wordsInUse)
&& ((words[wordIndex] & (1L << bitIndex)) != 0);
}

同样的两步走，这里的位操作时&。可以看到，如果指定的bit不存在的话，返回的是false，即没有设置。
jdk同时提供了一个获取指定区间的bitset的方法。当然这里的返回值会是一个bitset，是一个仅仅包含需要查询位的bitset。注意这里的大小也仅仅是刚刚能够容纳必须的位（当然，规整到long的整数倍）。代码如下：

[java] view plain copy print ?

public BitSet get(int fromIndex, int toIndex) {
checkRange(fromIndex, toIndex);
checkInvariants();
int len = length();
// If no set bits in range return empty bitset
if (len <= fromIndex || fromIndex == toIndex)
return new BitSet(0);
// An optimization
if (toIndex > len)
toIndex = len;
BitSet result = new BitSet(toIndex – fromIndex);
int targetWords = wordIndex(toIndex – fromIndex – 1) + 1;
int sourceIndex = wordIndex(fromIndex);
boolean wordAligned = ((fromIndex & BIT_INDEX_MASK) == 0);
// Process all words but the last word
for (int i = 0; i < targetWords – 1; i++, sourceIndex++)
result.words[i] = wordAligned ? words[sourceIndex] :
(words[sourceIndex] >>> fromIndex) |
(words[sourceIndex+1] << -fromIndex);
// Process the last word
long lastWordMask = WORD_MASK >>> -toIndex;
result.words[targetWords – 1] =
((toIndex-1) & BIT_INDEX_MASK) < (fromIndex & BIT_INDEX_MASK)
? /* straddles source words */
((words[sourceIndex] >>> fromIndex) |
(words[sourceIndex+1] & lastWordMask) << -fromIndex)
:
((words[sourceIndex] & lastWordMask) >>> fromIndex);
// Set wordsInUse correctly
result.wordsInUse = targetWords;
result.recalculateWordsInUse();
result.checkInvariants();
return result;
}

这里有一个tricky的操作，即fromIndex的那个bit会存在返回bitset的第0个位置，以此类推。如果fromIndex不是word对齐的话，那么返回的bitset的第一个word将会包含fromIndex所在word的从fromIndex开始的到fromIndex+1开始的的那几位（总共加起来是一个word的大小）。
其中>>>是无符号位想右边移位的操作符。

8. 获取当前bitset总bit的大小

[java] view plain copy print ?

/**
* Returns the “logical size” of this {@code BitSet}: the index of
* the highest set bit in the {@code BitSet} plus one. Returns zero
* if the {@code BitSet} contains no set bits.
*
* @return the logical size of this {@code BitSet}
* @since 1.2
*/
public int length() {
if (wordsInUse == 0)
return 0;
return BITS_PER_WORD * (wordsInUse – 1) +
(BITS_PER_WORD – Long.numberOfLeadingZeros(words[wordsInUse – 1]));
}

9. hashcode

hashcode是一个非常重要的属性，可以用来表明一个数据结构的特征。bitset的hashcode是用下面的方式实现的：

[java] view plain copy print ?

/**
* Returns the hash code value for this bit set. The hash code depends
* Note that the hash code changes if the set of bits is altered.
*
* @return the hash code value for this bit set
*/
public int hashCode() {
long h = 1234;
for (int i = wordsInUse; –i >= 0; )
h ^= words[i] * (i + 1);
return (int)((h >> 32) ^ h);
}

这个hashcode同时考虑了没给word以及word的位置。当有bit的状态发生变化时，hashcode会随之改变。

三 bitset使用

bitset的使用非常简单，只要对需要的操作调用对应的函数即可。

《Trie树》

Trie树 – 数据结构信息检索，字符串匹配等领域有着广泛的应用

也是后缀树，AC自动机的数据结构基础。

Trie树的查找，插入等操作的实现代码

《蕫的博客 – 数据结构Trie树》

http://dongxicheng.org/structure/trietree/

字典树，单词查找树或者前缀树。是一种用于快速检索的多叉树结构。

Trie树可以利用字符串的公共前缀来节约存储空间。

基本性质：

根节点不包含字符，除根节点外的每个节点只包含一个字符。

July结构之法算法之道《从Trie树（字典树）谈到后缀树（10.28修订）》

http://blog.csdn.net/v_july_v/article/details/6897097

典型应用是用于统计和排序大量的字符串（但不仅限于字符串），所以经常被搜索引擎系统用于文本词频统计。它的优点是：最大限度地减少无谓的字符串比较，查询效率比哈希表高。

Trie的核心思想是空间换时间。利用字符串的公共前缀来降低查询时间的开销以达到提高效率的目的。

对于一个单词，我只要顺着他从根走到对应的节点，再看这个节点是否被标记为红色就可以知道它是否出现过了。把这个节点标记为红色，就相当于插入了这个单词。这样一来我们查询和插入可以一起完成

用动态链表，或者用数组来模拟动态。

memset函数

memset函数使用

void *memset(void *s, int ch, unsigned n);memset原型 (please type “manmemset” in your shell)　　void *memset(void *s, intc, size_t n);　　memset:作用是在一段内存块中填充某个给定的值，它是对较大的结构体或数组进行清零操作的一种最快方法。　　常见的三种错误　　第一: 搞反了c 和 n的位置.　　一定要记住如果要把一个char a[20]清零, 一定是 memset(a, 0, 20)　　而不是 memset(a, 20, 0)　　第二: 过度使用memset, 我想这些程序员可能有某种心理阴影, 他们惧怕未经初始化的内存, 所以他们会写出这样的代码:　　char buffer[20];　　memset(buffer, 0,sizeof((char)*20));　　strcpy(buffer, “123”);　　这里的memset是多余的. 因为这块内存马上就被覆蓋了, 清零没有意义.　　第三: 其实这个错误严格来讲不能算用错memset, 但是它经常在使用memset的场合出现　　int some_func(struct something *a){　　…　　…　　memset(a,0, sizeof(a));　　…　　}　　问:为何要用memset置零?memset( &Address, 0,sizeof(Address))；经常看到这样的用法，其实不用的话，分配数据的时候，剩余的空间也会置零的。　　答:1.如果不清空，可能会在测试当中出现野值。你做下面的试验看看结果()　　char buf[5]；　　CString str,str1； //memset(buf,0,sizeof(buf))； for(inti = 0；i& lt;5；i++) {str.Format(“%d “,buf[i])； str1 +=str ； } TRACE(“%s\r \n“,str1)　　2.其实不然！特别是对于字符指针类型的，剩余的部分通常是不会为0的，不妨作一个试验，定义一个字符数组，并输入一串字符，如果不用memset实现清零，使用MessageBox显示出来就会有乱码（0表示NULL，如果有，就默认字符结束，不会输出后面的乱码）　　问:　　如下demo是可以的，能把数组中的元素值都设置成字符1，　　#include <iostream>　　#include<cstring>　　using namespace std;　　int main()　　{　　chara[5];　　memset(a,’1′,5);　　for(inti = 0;i < 5;i++)　　cout<<a[i]<<“”;　　system(“pause”);　　return 0;　　}　　而，如下程序想吧数组中的元素值设置成1，却是不可行的　　#include <iostream>　　#include <cstring>　　using namespacestd;　　int main()　　{　　int a[5];　　memset(a,1,5);//这里改成memset(a,1,5 *sizeof(int))也是不可以的　　for(int i= 0;i < 5;i++)　　cout<<a[i]<<“”;　　system(“pause”);　　return 0;　　}　　问题是：　　1，第一个程序为什么可以，而第二个不行，　　2，不想要用for，或是while循环来初始化int a[5];能做到吗？（有没有一个像memset（）这样的函数初始化）　　答:　　1.因为第一个程序的数组a是字符型的，字符型占据内存大小是1Byte，而memset函数也是以字节为单位进行赋值的，所以你输出没有问题。而第二个程序a是整型的，使用 memset还是按字节赋值，这样赋值完以后，每个数组元素的值实际上是0x01010101即十进制的16843009。你看看你输出结果是否这样？　　2.如果用memset(a,1,20);　　就是对a指向的内存的20个字节进行赋值，每个都用ASCII为1的字符去填充，转为二进制后，1就是00000001,占一个字节。一个INT元素是4 字节，合一起就是00000001000000010000000100000001，就等于16843009，就完成了对一个INT元素的赋值了

编辑本段程序例

　　#include <string.h>　　#include <stdio.h>　　#include<memory.h>　　

memset函数

int main(void)　　{　　char buffer[] = “Helloworld\n”;　　printf(“Buffer before memset:%s\n”, buffer);　　memset(buffer, ‘*’,strlen(buffer) );　　printf(“Buffer after memset:%s\n”, buffer);　　return 0;　　}　　输出结果：　　Buffer before memset: Hello world　　Buffer after memset: ************　　编译平台：　　Microsoft Visual C++ 6.0　　也不一定就是把内容全部设置为ch指定的ASCII值，而且该处的ch可为int或者其他类型，并不一定要是char类型。例如下面这样：　　int array[5] = {1,4,3,5,2};　　for(int i = 0;i < 5; i++)　　cout<<array[i]<<“”;　　cout<<endl;　　memset(array,0,5*sizeof(int));　　for(int k = 0; k < 5; k++)　　cout<<array[k]<<“”;　　cout<<endl;　　输出的结果就是：　　1 4 3 5 2　　0 0 0 0 0　　后面的表大小的参数是以字节为单位，所以，对于int或其他的就并不是都乘默认的1（字符型）了。而且不同的机器上int的大小也可能不同，所以最好用sizeof（）。　　　　要注意的是，memset是对字节进行操作，　　所以上述程序如果改为　　int array[5] = {1,4,3,5,2};　　for(int i = 0;i < 5; i++)　　cout<<array[i]<<“”;　　cout<<endl;　　memset(array,1,5*sizeof(int));// 注意这里与上面的程序不同　　for(int k = 0; k < 5; k++)　　cout<<array[k]<<” “;　　cout<<endl;　　输出的结果就是：　　1 4 3 5 2　　16843009 16843009 1684300916843009 16843009　　为什么呢？　　因为memset是以字节为单位就是对array指向的内存的4个字节进行赋值，每个都用ASCII为1的字符去填充，转为二进制后，1就是00000001,占一个字节。一个INT元素是4字节，合一起就是00000001000000010000000100000001，就等于16843009，就完成了对一个INT元素的赋值了。　　所以用memset对非字符型数组赋初值是不可取的！　　例如有一个结构体Some x，可以这样清零：　　memset( &x, 0, sizeof(Some) );　　如果是一个结构体的数组Some x[10]，可以这样：　　memset( x, 0,sizeof(Some)*10 );

编辑本段memset函数详细说明

　　1。void*memset(void *s,int c,size_t n)　　总的作用：将已开辟内存空间 s 的首 n 个字节的值设为值 c。　　2。例子　　main(){　　char *s=”Golden Global View”;　　clrscr();　　memset(s,’G’,6);//貌似这里有点问题// 单步运行到这里会提示内存访问冲突　　printf(“%s”,s);　　getchar();　　return 0;　　}　　　【以上例子出现内存访问冲突应该是因为s被当做常量放入程序存储空间，如果修改为　　char s[]=”GoldenGlobal View”;则没有问题了。】　　【应该是没有问题的，字符串指针一样可以，并不是只读内存，可以正常运行】　　3。memset() 函数常用于内存空间初始化。如：　　char str[100];　　memset(str,0,100);　　4。memset()的深刻内涵：用来对一段内存空间全部设置为某个字符，一般用在对定义的字符串进行初始化为‘memset(a, ‘\0′, sizeof(a));　　memcpy用来做内存拷贝，你可以拿它拷贝任何数据类型的对象，可以指定拷贝的数据长度；例：　　char a[100], b[50];　　memcpy(b, a,sizeof(b)); //注意如用sizeof(a)，会造成b的内存地址溢出。　　strcpy就只能拷贝字符串了，它遇到’\0’就结束拷贝；例：　　char a[100], b[50];　　strcpy(a,b);　　如用strcpy(b,a)，要注意a中的字符串长度（第一个‘\0’之前）是否超过50位，如超过，则会造成b的内存地址溢出。　　5.补充：某人的一点心得　　memset可以方便的清空一个结构类型的变量或数组。　　如：　　structsample_struct　　{　　charcsName[16];　　int iSeq;　　intiType;　　};　　对于变量　　structsample_strcut stTest;　　一般情况下，清空stTest的方法：　　stTest.csName[0]=’\0’;　　stTest.iSeq=0;　　stTest.iType=0;　　用memset就非常方便：　　memset(&stTest,0,sizeof(struct sample_struct));　　如果是数组：　　struct sample_struct TEST[10];　　则　　memset(TEST,0,sizeof(structsample_struct)*10);　　另外：　　如果结构体中有数组的话还是需要对数组单独进行初始化处理的。

海量数据处理算法题

给定一个200MB的文本文件，里面存的是IP地址到真实地址信息的映射信息，例如：211.200.101.100 北京。然后给你6亿个IP地址，请设计算法快速的打印出所对应的真实地址信息。

解法：

1.用hashmap，200MB文件读入内存，遍历每一行，每个IP为Key，真实地址为value，构建hashmap。

2.遍历6亿个IP，hash查找，key的value。

在2.5亿个整数中找出不重复的整数。（只出现一次的整数）

int getval(const unsigned char& c,int num){
int i=0,j=0;
if(((0x1<<(2*num))&c)==(0x1<<(2*num)))
i=1;
if(((0x1<<(2*num+1))&c)==(0x1<<(2*num+1)))
j=1;
return 2*j+i;
}
void setval(unsigned char& c,int num,int val){
if(val==1){
c= c|(0x1<<(2*num));
}else if(val==2){
c= c&(~((0x1<<(2*num))));
c= c|(0x1<<(2*num+1));
}
}
void setbit(unsigned char* a,int num){
unsigned char* p=a;
for(int i=0;i<num/4;++i){
p++;
}
if(getval(*p,num)==0){
setval(*p,num%4,1);
}else if(getval(*p,num)==1){
setval(*p,num%4,2);
}
}
int main(int argc, char** argv) {
unsigned char a[1024*1024*1024];
memset(a,0,sizeof(a));
FILE* file=fopen(“in.txt”,“r”);
unsigned uu=250000000;
char rn=‘\n’;
for(unsigned int i=0;i<uu;++i){
int r;
fscanf(file,“%d”,&r);
setbit(a,r);
// fwrite(&rn,1,1,file);
}
unsigned count=0;
unsigned char* p=a;
for(size_t i=0;i<1024*1024*1024;++i){
for(int j=0;j<4;++j){
if(getval(*p,j)==1)
count++;
}
p++;
}
cout<<count<<endl;
fclose(file);
return 0;
}

Bitmap – 位图数据结构 – 海量数据处理算法

《蕫的博客》http://dongxicheng.org/structure/bitmap/

《海量数据处理算法—Bit-Map》http://blog.csdn.net/hguisu/article/details/7880288

索引，数据压缩方面有广泛应用。

《编程珠玑》

其中i>>SHIFT 相当于 i/32得到对应数组下标
i&MASK相当于 i mod 32，得到应该设置在哪位
1<<(i&MASK)相当于获得2的(i&MASK)次幂。

1 #include<stdio.h>

2 #defineBITSPERWORD 32

3 #defineSHIFT 5

4 #defineMASK 0x1F

5 #defineN 10000000

6 int a[1+N/BITSPERWORD]; /* 表示1000万个整数的位向量 */

7 /* 设置整数i所在的位 */

8 void set(int i){

9 /* a中每个元素能表示32个整数，因此表示整数i的位是元素a[i/32]中的某个位，

10 这个位在a[i/32]的左越第i& 5（i的末五位表示的整数）个位处

11 */

12 a[i>>SHIFT]|= (1<<(i & MASK)); 与第 i & 5 位的1或

13 }

14 /* 清除整数i所在的位 */

15 void clr(int i){

16 a[i>>SHIFT]&= ~(1<<(i & MASK));

17 }

18 /* 测试位向量中是否有整数i */

19 int test(int i){

20 return a[i>>SHIFT] & (1<<(i &MASK));

21 }

22 int main(void){

23 int i;

24 for(i=0;i<N;++i)

25 clr(i);

26 while(scanf(“%d”,&i)!=EOF) //输入要排序的整数

27 set(i);

28 for(i=0;i<N;++i)

29 if(test(i))

30 printf(“%d/n”,i);

31 return 0;

32 }