Javascript DataView comes with more flexibility than TypedArray for working with binary data aka byteArray aka ArrayBuffer. It provides some getter/setter methods to read and write arbitrary data to the buffer. It is very easy to get numeric value using these getter methods like getUint32() or setFloat64() but there is no getter method like getString() , problem will be more complex if the character encoding is any multi-bytes encoding like UTF-8 or UTF-16. UTF-8 is multi-bytes encoding system which means for each character, it may take up to 4 bytes starting from one byte. Thats means minimum number of byte is 1 and maximum numbers of bytes is 4 for UTF-8 encoding.
In this article we will write a function getString() and add it into DataView prototype so that we can use getString() in the similar way of getUint32(). First look at the UTF-8 binary format of byte sequence.
I am not going detail line by line. Here is the final code of getString()
You can download source code from github
In this article we will write a function getString() and add it into DataView prototype so that we can use getString() in the similar way of getUint32(). First look at the UTF-8 binary format of byte sequence.
Binary format of bytes in sequence
| 1st Byte | 2nd Byte | 3rd Byte | 4th Byte | Number of Free Bits | Maximum Expressible Unicode Value |
|---|---|---|---|---|---|
| 0xxxxxxx | 7 | 007F hex (127) | |||
| 110xxxxx | 10xxxxxx | (5+6)=11 | 07FF hex (2047) | ||
| 1110xxxx | 10xxxxxx | 10xxxxxx | (4+6+6)=16 | FFFF hex (65535) | |
| 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | (3+6+6+6)=21 | 10FFFF hex (1,114,111) |
The value of each individual byte indicates its UTF-8 function, as follows:
- 00 to 7F hex (0 to 127): first and only byte of a sequence.
- 80 to BF hex (128 to 191): continuing byte in a multi-byte sequence.
- C2 to DF hex (194 to 223): first byte of a two-byte sequence.
- E0 to EF hex (224 to 239): first byte of a three-byte sequence.
- F0 to FF hex (240 to 255): first byte of a four-byte sequence.
For reading UTF-8 format detail click here
I am not going detail line by line. Here is the final code of getString()
You can download source code from github
DataView.prototype.getString = function(offset,length){
var self = this,bitArray = [], firstByte, highSurrogate, lowSurrogate, codePoint;
length = length || self.byteLength;
while( length > 0 ) {
firstByte = self.getUint8(offset);
if(self.getUint8(offset) <= 127) {
bitArray.push(self.getUint8(offset++));
length--;
}
else if(self.getUint8(offset) >= 128 && self.getUint8(offset) <= 223) {
bitArray.push(((self.getUint8(offset++) & 0x1F) << 6) | (self.getUint8(offset++) & 0x3F));
length -=2;
}
else if(self.getUint8(offset) >= 224 && self.getUint8(offset) <= 239) {
bitArray.push(((self.getUint8(offset++) & 0x1F) << 12) | ((self.getUint8(offset++) & 0x3F) << 6 | (self.getUint8(offset++) & 0x3F)));
length -=3;
}
else {
codePoint = ((self.getUint8(offset++) & 0x07) << 18) | (((self.getUint8(offset++) & 0x3F) << 12) | ((self.getUint8(offset++) & 0x3F) << 6 | (self.getUint8(offset++) & 0x3F)));
codePoint -= 0x10000;
highSurrogate = (codePoint >> 10) + 0xD800;
lowSurrogate = (codePoint % 0x400) + 0xDC00;
bitArray.push(highSurrogate, lowSurrogate);
length -=4;
}
}
return String.fromCharCode.apply(null,bitArray);
};
Now we can get UTF-8 encoded string as bellow. I am assuming we have ready buffer from websocket or ajax
var dataview = new DataView(buffer);
dataview.getString(0,100); // get string of 100 lengths from the offset 0
No comments:
Post a Comment