C++ for JavaScript Developers

03 Jul 2020

Throughout my last few years of college, I coded primarily in JavaScript. When I arrived at my first full-time job, I discovered I would be learning C++. Having learned new languages before, picking up the syntax wasn't particularly difficult. However, I discovered there are entire classes of problems one can encounter in C++ but not JavaScript.

In this post, I go over C++ behavior I found surprising. It's what I wish I could have read before starting to write C++ code. While I name JavaScript in the title, I expect it could be substituted for many higher-level languages. I use TypeScript instead of JavaScript in some code samples for increased clarity; the two languages behave the same.

Value vs. Reference

Consider the following TypeScript function.

function logEachAndEmpty(messages: string[]): void {
  for (const message of messages) {
    console.log(message);
  }
  messages.length = 0;
}

It logs each string in an array and empties the array.

Now consider this function, which seems to be a C++ translation.

void LogEachAndEmpty(std::vector<std::string> messages) {
  for (std::string message : messages) {
    std::cout << message << std::endl;
  }
  messages.clear();
}

Both functions log each string, but only the TypeScript function empties the input sequence. This is because function arguments are passed by value (that is, copy) in C++ by default. In addition to preventing us from mutating the input, this can negatively impact performance, especially for a potentially-large input like a vector of strings. To fix this, we can instead pass the argument by reference.

void LogEachAndEmpty(std::vector<std::string>& messages) {
  for (std::string& message : messages) {
    std::cout << message << std::endl;
  }
  messages.clear();
}

Note that we changed both the type of the argument and the type in the for loop. We were previously copying each string twice: once when copying the vector and once in the for loop.

A good rule of thumb is to take objects by reference and primitives (e.g. bool, char, int) by value.

Reassignment is Pointer Mutation

Here's an example of one of the biggest gotchas for beginner JavaScript programmers.

let a = 0;
let b = 1;
a = b;  // a is now 1
b += 1;  // b is now 2 (a is still 1)

let x = [0];
let y = [1];
x = y;  // x is now [1]
y[0] += 1;  // x and y are now [2]

If you understand the above example, then the behavior of C++ may be surprising.

int a = 0;
int b = 1;
a = b;  // a is now 1
b += 1;  // b is now 2 (a is still 1)

std::vector<int> x = {0};
std::vector<int> y = {1};
x = y;  // x is now {1}
y[0] += 1;  // y is now {2} (x is still {1})

This is because the = operator in C++ means invoke the copy constructor. If you want to mimic JavaScript's reassignment, you must use a pointer. A pointer, like a reference, is a variable which allows you to access another variable. Unlike a reference, a pointer can be changed.

std::vector<int> x = {0};
std::vector<int> y = {1};
std::vector<int>* p = &x;
std::vector<int>* q = &y;
p = q;  // *p is now {1} (x is still {0})
q[0] += 1;  // *p, *q, and y are now {2} (x is still {0})

Note that x still exists and is unchanged from its original value. This is different than the JavaScript example, where the original array x referred to is gone.

A Word About Syntax

In JavaScript, you declare variables with let and const. The former allows the variable to be reassigned while the latter does not. Either way, the variable can be a mutable object. In TypeScript, you can use a type to prevent mutation.

let yesAssignYesMutate: boolean[] = [];
let yesAssignNoMutate: readonly boolean[] = [];
const noAssignYesMutate: boolean[] = [];
const noAssignNoMutate: readonly boolean[] = [];

In C++, reassignment is mutation. To recreate the example in C++, we need pointers.

std::vector<bool>* yes_assign_yes_mutate;
const std::vector<bool>* yes_assign_no_mutate;
std::vector<bool>* const no_assign_yes_mutate;
const std::vector<bool>* const no_assign_no_mutate;

Note that const after * means that the pointer cannot be changed while const before the type means that what the pointer points to cannot be changed. const modifies whatever comes before it; if there is nothing before it, const modifies what comes after it. This means the immutable examples could have also been written like this, though I find this style to be less common.

std::vector<bool> const* yes_assign_no_mutate;
std::vector<bool> const* const no_assign_no_mutate;

The spacing before and after the * modifier (and & for references) also does not matter. I find the first two styles listed below to be about equally common.

int* space_l;
int *space_r;
int * space_b;

* and & modify what comes after them. This is relevant when declaring multiple variables in one line, though doing so is generally considered to be bad style.

int* pointer, *also_pointer, not_pointer;

Note that * and & are both type modifiers and operators.

int variable;
int& reference(variable);  // & means reference type.
int* pointer;  // * means pointer type.
pointer = &variable;  // & means get pointer from variable.
variable += *pointer;  // * means get variable from pointer.

Data Live Places

Here is TypeScript code which creates a binary tree.

interface GraphNode {
  children: GraphNode[];
}

function makeBinaryTree(depth: number): GraphNode {
  const node: GraphNode = {children: []};
  if (depth > 1) {
    for (let i = 0; i < 2; i++) {
      node.children.push(makeBinaryTree(depth - 1));
    }
  }
  return node;
}

Now consider what seems to be a C++ translation.

struct GraphNode {
  std::vector<GraphNode*> children;
};

GraphNode MakeBinaryTree(int depth) {
  GraphNode node;
  if (depth > 1) {
    for (int i = 0; i < 2; i++) {
      GraphNode child_node = MakeBinaryTree(depth - 1);
      node.children.push_back(&child_node);
    }
  }
  return node;
}

Unfortunately, this function probably will not work. This is because in C++, data live places. In JavaScript, if you can access an object, you can use it. In C++, this is not necessarily true. When you declare a variable like we do with child_node, it is allocated on the stack: essentially, right where you created it. Stack-allocated variables get destroyed when they go out of scope. This means that the pointer we added to the vector refers to freed memory. Reading it might work, but the language provides no such guarantee.

We already know one solution: make a copy. This is fine if we don't plan to mutate the GraphNode after returning it from MakeBinaryTree, but it is an unsuitable approach for a changing graph. It may also hurt performance for large graphs or if more data members get added to GraphNode.

If we need an object to outlive the scope where it's constructed, we need to allocate it on the heap. A heap-allocated object can be constructed with the new keyword.

GraphNode* MakeBinaryTree(int depth) {
  auto node = new GraphNode();
  if (depth > 1) {
    for (int i = 0; i < 2; i++) {
      node->children.push_back(MakeBinaryTree(depth - 1));
    }
  }
  return node;
}

Now each GraphNode is guaranteed to exist when we dereference its pointer. However, we have a new problem: when a pointer goes out of scope, the object it points to is not destructed. If all pointers to allocated memory are gone, this is known as a memory leak, and it can cause issues. To destroy an object and free its memory, use the delete keyword.

GraphNode* root = MakeBinaryTree(10);
DoSomethingWithBinaryTree(root);
delete root;

Unfortunately, we still have a memory leak. Only the root GraphNode was deleted, and now there are no more pointers to its children. We can fix this by giving GraphNode a destructor, which is called when delete is used but before memory is freed, to delete its children (which will recursively delete all of its descendants).

struct GraphNode {
  ~GraphNode() {
    for (GraphNode* child : children) {
      delete child;
    }
  }
  std::vector<GraphNode*> children;
};

Knowing that GraphNode deletes its descendants upon deletion, users of it (including you, probably) will need to take care when holding onto pointers.

GraphNode* root = MakeBinaryTree(10);
GraphNode* left_child = root->children[0];
delete root;
DoSomethingWithBinaryTree(left_child);  // This is not safe.

The mistake in the example above is fairly obvious, but as complexity grows, raw pointers become harder to reason about. To prevent such mistakes (and save a few lines of code), std::unique_ptr can be used. This type behaves like a pointer but calls delete on the object it points to when destructed.

struct TreeNode {
  std::vector<std::unique_ptr<TreeNode>> children;
};

std::unique_ptr<TreeNode> MakeBinaryTree(int depth) {
  auto node = std::make_unique<TreeNode>();
  if (depth > 1) {
    for (int i = 0; i < 2; i++) {
      node->children.push_back(MakeBinaryTree(depth - 1));
    }
  }
  return std::move(node);
}

Now when TreeNode (or std::unique_ptr<TreeNode>) is destroyed, its descendants will be as well. std::move is required at the end of the function because std::unique_ptr cannot be copied. It is not required on the line with push_back because the compiler is smart enough to store the return value of MakeBinaryTree in the vector from the beginning.

Notice that I've changed the name of the struct from GraphNode to TreeNode. This is because using std::unique_ptr means that each node can only have one parent, which is a constraint of trees but not of graphs in general. If we need nodes with multiple parents, we can use std::shared_ptr, which can be copied and uses reference counting to delete the object it points to when all copies of the std::shared_ptr are destroyed. Take caution when using std::shared_ptr, since memory leaks are still possible with cyclic references.

struct GraphNode {
  std::vector<std::shared_ptr<GraphNode>> children;
};

void LeakMemory() {
  auto node_a = std::make_shared<GraphNode>();
  auto node_b = std::make_shared<GraphNode>();
  node_a->children.push_back(node_b);
  node_b->children.push_back(node_a);
}

None of these problems exist in JavaScript because JavaScript uses garbage collection, which means it frees memory automatically after is no longer accessible. Garbage collection has a performance cost and doesn't necessarily free memory right away, which is part of what makes C++ faster than most other languages. However, the lack of garbage collection means C++ developers need to think harder about the lifetime of variables.

Via more complex reference counting, you could make unreachable GraphNode objects be destroyed automatically. However, I leave this as an exercise to the reader. I, on the other hand, will download a battle-tested, open-source graph library if I ever need one.

Multithreading is Possible

Consider the following TypeScript program.

let x = 0;
const increment = () => {
  x += 1;
};

const promises: Promise<void>[] = [];
for (let i = 0; i < 100; i += 1) {
  promises.push(Promise.resolve().then(increment));
}
await Promise.all(promises);
console.log(x);

It starts 100 asynchronous tasks each to increment x, waits for them all to finish, and logs the value of x. This program will always log 100.

Now let's try it in C++.

int x = 0;
std::function<void()> increment = [&x] {
  x += 1;
};

std::vector<std::thread> threads;
for (int i = 0; i < 100; i += 1) {
  threads.emplace_back(increment);
}
for (std::thread& thread : threads) {
  thread.join();
}
std::cout << x << std::endl;

On my machine, this program usually logs 100, but sometimes it will log something slightly off such as 99 or 98. Like in TypeScript, increment and decrement are being called 100 times each. However, in single-threaded TypeScript, only one function is ever being executed at a time; in multithreaded C++, this is not the case. If two threads modify a variable at the same time, the result is unspecified.

When writing multithreaded C++ code, you generally must ensure the following.

  • No writes during reads or other writes.
  • No reads during writes.

Generally, the way we ensure this is with a mutex. A mutex can be locked and unlocked, and it can only be locked once. If the mutex is already locked, a thread needing to lock the mutex must wait for it to be unlocked. By ensuring each function only modifies x after locking a mutex, we can fix our bug.

std::mutex mutex;
int x = 0;  // Guarded by mutex.
std::function<void()> increment = [&x, &mutex] {
  std::lock_guard<std::mutex> guard(mutex);
  x += 1;
};

We do not need to worry about concurrent modification of the mutex because the lock and unlock operations are atomic. We could have instead changed the type of x to std::atomic_int, but for non-trivial types, a mutex is the way to ensure thread safety.