Story heading

How I Built the Fastest JavaScript Data Differ

January 1, 2022 (Updated January 2, 2022)

Microdiff, a diffing algorithm for structured data in JavaScript that I maintain, is pretty fast. So fast, in fact, that you might be wondering whether there is black magic behind the scenes, or perhaps randomness and a lot of luck when benchmarking. An inquisitve and probably bored developer asked me to explain why Microdiff is this fast:

I believe that an explainer (perhaps a blog post?) of the inefficiencies you’ve noticed with other libraries and how you’ve overcome them (including the “edge cases” you’re saying are unsupported) would be very interesting. They’d also help correctly interpret the benchmark results.

So, I decided to write a blog post about it, describing how I made Microdiff faster than most other object and array diffing libraries.

What is diffing?

Diffing (difference tracking) algorithms track changes made between two pieces of data. For example, if you have two objects, a b:

const a = {
	bananas: true,
	apples: true,
	peaches: true,
};
const b = {
	bananas: true,
	apples: false,
	lemons: true,
};

With Microdiff you could retrieve the differences by passing both to the diffing function:

import diff from "microdiff";
console.log(JSON.stringify(microdiff(a, b)));

/*
[
    {
        'type': 'CHANGE',
        'path': ['apples'],
        'value': false,
        'oldValue': true,
    {
        'type': 'REMOVE',
        'path': ['peaches'],
        'oldValue': true
    },
    {
        'type': 'CREATE',
        'path': ['lemons'],
        'value': true
    }
]
*/

All edits, removals, and additions are recorded. These records can be used to reconstruct the new data from the original data, meaning data updates can be transferred using diffing records. In cases where data transfer is expensive and the diffs are relatively small in comparison to the overall piece of data, diffing is often used to maximize performance. This includes Virtual DOMs, document editing functionalities, NoSQL databases, and quite a few other pieces of software.

How fast is Microdiff?

To measure Microdiff’s performance, I took a sampling of benchmarks matching real-world use cases from PandaCSS’s configuration diffing, Scalar’s OpenAPI document diffing, and vanilla-extract’s theme diffing. All of these benchmarks match the document formats used in these open-source projects, and are chosen to vary in size and data types included. I measured the time/operation for each of these samples under Microdiff and the 4 most popular JavaScript diffing libraries.

All benchmarks were run multiple times under mitata on Node v22.12.0.

Microdiff vastly outperforms all other diffing algorithms in these benchmarks, especially with cycle protection disabled (which is a feature not included in many diffing algorithms anyway), being 97% faster than the next closest algorithm, deep-diff

How is Microdiff so fast?

Unified rich type handling

One central aspect of JavaScript’s design, tracing back to JavaScript’s attempt to copy Java and its OOPisms, is the extensive use of unique objects, like Date and RegExp. Unfortuantely, these data types make diffing a lot more complicated. If you attempt to diff Dates with a naive diffing algorithm designed for primitives and basic objects only, the algorithm will fail. Either it will attempt to compare the dates using a strict equality comparison === and mark two equal dates as unequal, or it will interpret the dates as basic objects and attempt to recursively diff them. Both of these outcomes will give you the incorrect result.

const date = new Date();
const equivalentDate = new Date(date);
console.log(date === equivalentDate);
// false

To handle these more complex types, prior JavaScript diffing algorithms resorted to creating a list of potential data types, each with its own classification and comparison method. For example, deep-diff’s code for datatype classifcation checks for RegEx and Dates individually (and, oddly enough, the math global, which I will talk about later):

var type = typeof subject;
if (type !== "object") {
	return type;
}

if (subject === Math) {
	return "math";
} else if (subject === null) {
	return "null";
} else if (Array.isArray(subject)) {
	return "array";
} else if (Object.prototype.toString.call(subject) === "[object Date]") {
	return "date";
} else if (
	typeof subject.toString === "function" &&
	/^\/.*\//.test(subject.toString())
) {
	return "regexp";
}
return "object";

Each datatype handled here requires a separate check, which quickly increases overhead for every piece of data that is not a primitive.

Microdiff standardizes the handling of these data types. Instead of using a separate comparison method for every data type, Microdiff handles the vast majority of cases with three:

  1. A simple strict equality operator
  2. Implicit stringification or numeric conversion using .toString() or +
  3. Deep diffing, creating a new nested Microdiff instance
// Comparison method 1
obj1 === obj2;

// Comparison method 2
obj1.toString() === obj2.toString();
// More accurately, in Microdiff method 2 is implemented like:
isNaN(obj1) ? obj1 + "" === obj2 + "" : +obj1 === +obj2;

// Comparison method 3
microdiff(obj1, obj2); // recursively call Microdiff, add diffs

Each of these cases can handle primitives, objects with custom prototypes that include stringification methods, and basic objects with default prototypes, respectively. Now, instead of needing to classify exactly what datatype a certain value is, Microdiff only needs to figure out which of those three classes the object fits in to.

Microdiff separates objects and primitives using typeof, similar to existing diffing algorithms. However, to handle all rich types, I added a lookup table that compared the data’s prototype tree with a lookup table of known rich types:

const richTypes = { Date: true, RegExp: true, String: true, Number: true };
const isRichType = richTypes[Object.getPrototypeOf(value)?.constructor?.name];

Now Microdiff will coerce any object conforms to one of the classes in richTypes to a primitive rather than attempting to execute a nested diff, while remaining within a time complexity that will not scale with the number of richTypes.

Avoiding unnecessary checks and exceptions

The example of deep-diffs type classification mechanism used earlier in the article also highlights another unoptimized design feature of most other diffing alogorithms—they handle edge cases that should never occure. In that example, the first conditional checks for whether an object is Math:

if (subject === Math) {
	return "math";
}

To be clear, this piece of code is comparing the data to Math, the global. You know, the object you use to get the max of values:

console.log(Math.max(1, 2, 3));
// 3

I have no clue why anyone would be passing this into data. Including Math is almost like including the standard library in another language, which is pretty much impossible in any language other than JavaScript. All of this is to say, Math contains utility functions, NOT data, and should never be in an object you are diffing.

Most diffing algorithms spend valuable CPU cycles and bytes of code checking for values like Math. Microdiff does not. As time has gone on, Microdiff has gradually included support for more data types, to the point where it covers almost anything that should be included in JavaScript data. However, Microdiff avoids covering these extra cases that should never exist in practice.

Simplifying the execution model

Most other diffing algorithms are built as classes, with a variety of methods and helper functions performing specific subtasks in the overall diffing process. Microdiff instead uses a single function that recurses for nested objects, and intentionally does not follow DRY principles. Because JavaScript in a JIT compiled language, the semantics used to maximize code sharing generate significant overhead, especially at the beginning of execution. Even the calls to utility functions can quickly become expensive when they are required for every property being diffed. Most optimizing JavaScript engines will eventually unroll some of these functions, lessening the performance hit, but the unrolling process is requires multiple invocations and is unreliable.

Microdiff’s architecture maximizes performance without these optimizations, and, as a side effect, in my opinion, the code is actually more readable. Instead of being a complex tree of scattered utility functions, Microdiff is a single function that fits within 1kB of code.

Final Thoughts

I am slightly amazed you are interested enough in diffing algorithms to reach the end of this. If you think you have any further improvements, come over to the Microdiff repo and open a pull request, but more importantly, I hope you learned some helpful optimization tips.

Share

Sign up for email updates

Get interesting posts about web development and more programming straight in your inbox!