Converting Buffer Chunks into Strings

It's common in Node applications to read multiple Buffer objects, convert them to strings, and concatenate them together. What I'm describing looks like this, using a Socket as an example:

var msg = '';

connection.on('data', function(chunk) {  
  msg += chunk.toString();
});

connection.on('end', function() {  
  console.log(msg);
});

The Problem

A common mistake with this pattern is not accounting for multi-byte character encodings such as UTF-8. A single UTF-8 character can be comprised of several bytes. For example, my favorite character, the snowman (☃), is represented in three bytes ([0xe2, 0x98, 0x83]). You can verify this using console.log(new Buffer([0xe2, 0x98, 0x83]).toString());

The problem is that it's possible for a 'data' event to be triggered before all three bytes have been received. In that case, the original code would convert the bytes to a different character when chunk.toString() is called. A client can artificially reproduce this using the following code:

var chunk1 = new Buffer([0xe2]);  
var chunk2 = new Buffer([0x98, 0x83]);

socket.write(chunk1, function() {  
  setTimeout(function() {
    socket.write(chunk2, function() {
      socket.end();
    });
  }, 5000);
});

The Solution

The solution to this problem is a bit anticlimactic. Before handling any 'data' events, simply call setEncoding() on the source. In the original example, it would look like:

var msg = '';

connection.setEncoding('utf8');

connection.on('data', function(chunk) {  
  msg += chunk;
});

connection.on('end', function() {  
  console.log(msg);
});

Another solution is to store all of the chunks in an array, and concatenate them into a single Buffer once all of the data has been received. This pattern looks like:

var chunks = [];

connection.on('data', function(chunk) {  
  chunks.push(chunk);
});

connection.on('end', function() {  
  var msg = Buffer.concat(chunks).toString();

  console.log(msg);
});

An Example

Here is a complete example that sends a bunch of snowman characters from a client to a server. The string is validated when it is created on the client, and then sent to the server. Upon receiving a connection, the server reads in all of the data, with no regard to the character encoding. Each chunk is concatenated to a string variable, as well as stored in an array of Buffers. Once all of the data has been received, the concatenated string is validated. The array of Buffers is then converted into a string and validated.

'use strict';  
var assert = require('assert');  
var net = require('net');  
var snowman = '☃';  
var count = 100000;

function validate(str) {  
  try {
    assert.strictEqual(str.length, count);
    assert.strictEqual(Buffer.byteLength(str), count * 3);

    for (var i = 0; i < count; ++i) {
      assert.strictEqual(str.charAt(i), snowman);
    }
  } catch (err) {
    console.log('Validation failed!');
    console.log(err.stack);
    return;
  }

  console.log('Validation passed!');
}

var server = net.createServer(function(sock) {  
  var str = '';
  var chunks = [];

  sock.on('data', function(data) {
    str += data.toString();
    chunks.push(data);
  });

  sock.on('end', function() {
    validate(str);
    validate(Buffer.concat(chunks).toString());
    server.close();
  });
});

server.listen(function() {  
  var socket = net.createConnection({port: server.address().port}, function() {
    var str = '';

    for (var i = 0; i < count; ++i) {
      str += snowman;
    }

    validate(str);
    socket.write(str, function() {
      socket.end();
    });
  });
});

Running this application yields the following results (your mileage may vary based on operating system and version of Node):

Validation passed!  
Validation failed!  
AssertionError: 100006 === 100000  
    at validate (/private/tmp/buf2str.js:9:12)
    at Socket.<anonymous> (/private/tmp/buf2str.js:34:5)
    at emitNone (events.js:72:20)
    at Socket.emit (events.js:166:7)
    at endReadableNT (_stream_readable.js:889:12)
    at doNTCallback2 (node.js:429:9)
    at process._tickCallback (node.js:343:17)
Validation passed!  

The first Validation passed! message indicates that the message sent from the client to the server is correct. The Validation failed! message shows that string concatenation without regard for the character encoding caused the wrong message to be interpreted. The final success message shows that creating one large Buffer before converting to a string works fine.

Again, it's worth noting that string concatenation is fine if you call setEncoding() before processing any 'data' events.

Conclusion

This post illustrated the importance of character encoding when moving between Buffers and strings. Potential hazards were explored, as well as several techniques for mitigating them.