Basic Transaction-Like Behavior in Mongo (or SQL) / by Chris Shaffer

Okay, so if you really get transactions in SQL, you’ll realize quickly that we’re not really talking about transactions in the same sense. That’s kind of the point. Since an update or insert in Mongo is already atomic, what people usually really mean when they ask for “transactions” in Mongo is “protection against race conditions”:

What happens when two people simultaneously load the same record, make some changes, and then click ‘save’? One of them is going to have their changes blown away and not even know it. How do I prevent that?

An answer I’ve heard all-too-often is “well, we went with Mongo instead of SQL, and it doesn’t have transactions, so we’re just out of luck.” There are two problems with that:

  1. It’s not true, and
  2. It’s not true.

I’ll be more specific:

  1. Yes, you actually can do that in Mongo, and it’s not that hard.
  2. No, SQL transactions don’t solve that use case. They ensure that you don’t have to deal with data corruption — that an update across multiple tables doesn’t take effect on some, but not others — but SQL transactions don’t magically encapsulate the web browser, too.

Here’s a solution that will work pretty much identically in either Mongo or SQL:

var oldTid = record.trans_id || { $exists: false };
record.trans_id = (record.trans_id || 0) + 1;

return collection.updateAsync(
  { _id: record._id, trans_id: oldTid },
  { $set: record }
).then(result => {
  if (result.result.nModified) {
    //you're done
  else {
    //you got up-streamed, and your "transaction" had to "roll back"

That’s it! The roll-back code can be as complicated as Git, if you want to diff documents to find out whether there were any actual conflicts, or as simple as displaying an “oops! document has been edited by someone else!” message.

We can use a similar strategy for a change that affects multiple collections. Let’s say we have a companies collection and an employees collection, such that each employee works for a single company (for now), and we have to update a company and its employees in one go.

  • Insert into a companyLocks collection, with a unique key on companyId
  • If that fails, abort
  • Make your updates to companies and employees
  • Delete your record from companyLocks

The trick is that every command, anywhere, that updates either companies or employees needs to check that companyLocks collection before running.

This is fine when an employee can only exist at one company — employees are logically sub-documents of the company document, and the only reason they’re in separate collections at all is … a performance optimization … or maybe just because someone forgot they weren’t in SQL.

It gets more complicated if you have an employee working at multiple companies. Sure, you can plop down some $in’s and lock every employee that a company employs, but that means you’re effectively locking not only the company you’re updating, but every other company that shares an employee with it. That can lead to deadlocks, or just regular gridlock.

Generally, I’d ask why you needed to update companies and employees at the same time. Is it because you just have one edit page with one save button for company name, office address, and all of their employees’ names and salaries? In that case, it seems like a problem easily solved by UI — multiple save buttons, or no buttons and just saving automatically on changes.

Or is it because you’re computing an average salary and number of employees for each company? In that case, we’re really talking about a cache. You can work wonders by flagging the company’s cached values as “expired,” queuing it up for some background job that updates the cache a few seconds later.

Again, this cache job can be deceptively simple — just a setInterval — or if you want to spend an afternoon...

  • An array of jobs:
      name: string,
      run: function() { … },
      interval: integer,
      enabled: boolean,
      lastRun: date
  • A single setInterval “daemon” that runs once per second, checks which jobs need to be run, and calls .run() on them
  • A page with checkboxes to enable/disable the jobs
  • Initialize this job status array from and save it to a database store
  • Set up a separate application or machine to run it

Oh, and by the way, all of those strategies work pretty much the same way in SQL.