Writing Performant Spreadsheets
Computer scientists refer to the amount of time a program takes to execute on different sizes of input as its performance.
Spreadsheets are computer programs. Consequently, they are subject to the same computational limitations as other computer languages. Programmers in different computer languages, like Java, Python, and JavaScript, need to understand how fast their programs will run against large amounts of data. The same considerations hold for spreadsheet programmers.
A spreadsheet's performance depends on the amount of time each worksheet takes to complete. In turn, each worksheet's performance depends on the amount of time each cell takes to complete.
We have designed this section to help Pebblers create spreadsheets that can run against arbitrarily large input and still complete in the fastest possible time.
To understand whether or not your spreadsheet will execute quickly, Pebblers must first understand the different categories of cell-level operations in spreadsheet computing. Each of these different categories of operations has different performance profiles. Some categories perform better than others.
Pebblers should reference the following list of the different categories of cell level operations in order of fastest to slowest.
Relative Speed varies depending on the speed of your CPU and the amount of data being processed
The provided relative times are for worksheets that contain a thousands of rows of data. An individual Pebblers' experience may vary. Don't focus too much on the timings. Consider the relative performance of each category of cell or range operation to better understand performance impact.
Cell Level Operation | Example | Relative Speed | Scalable? | Big O Notation |
---|---|---|---|---|
Arithmetic Calculation in a single cell | 43 + 17 * 3 | Less than a millisecond | Yes | Constant time O(1) |
String Operation in a single cell | MID, TEXT, LEFT, RIGHT, LEN | Less than a millisecond | Yes | Linear time O(n) where n is the length of the string. String operations on long strings may take longer than on short strings. |
Cell Reference | A1, BA$54, $M$1024 | milliseconds | Yes | Constant time O(1) |
Single approximate match lookup on a sorted range | MATCH(D6,B6:B14,1) MATCH(D6,B6:B14,-1) | tens of milliseconds | Yes | Logarithmic time O(logn) where n is the number of cells in the searched range. The range must be sorted! |
Single summation on an unsorted range | SUM(A:A) | tens of milliseconds | Yes | Linear time O(n) where n is the number of rows in the summed range. |
Single exact match lookup on a range | MATCH(D6,B6:B14,0) | tens of milliseconds | Yes | Linear time O(n) where n is the number of cells in the searched range. |
grouping | sum, mult, balance | tens of milliseconds | Yes | Linear time O(n), where n is the number of rows in the worksheet to be grouped |
joining | join, outer-join | tens of milliseconds | Yes | Linear time O(n), where n is the number of rows in the worksheets to be joined. |
sorting | sort | hundreds of milliseconds | not really | Log-linear time O(nlogn), where n is the number of rows in the worksheet to be joined. Only apply sorting when grouping won't do! |
Approximate match applied to every row in a worksheet | Drag down using the match-rows of MATCH(D6,B6:B14,1) | hundreds of milliseconds | No | Linear time O(m * logn), where n is the number of cells in the search range, and m is the number of dragged-down cells the search is executed in. Consider joining instead! |
Exact match applied to every row in a worksheet | Drag down using the match-rows directive of MATCH(D6,B6:B14,0) | hundreds of milliseconds | No | Quadratic time O(m * n) where n is the number of cells in the searched range, and m is the number of dragged-down cells the search is executed in. Consider joining instead! |
Cartesian product | cartesian-product | hundreds of milliseconds | No | Quadratic time O(m * n) where m is the number of rows in the first sheet and n is the number of rows in the second sheet. Consider joining instead! |
Lookups (vlookup, xlookup, match) performed on multiple columns on every row in the worksheet. | MATCH(D6,B6:B14,0) | tens of seconds | No | Quadratic time O(m n c) where n is the number of cells in the search range, m is the number of dragged-down cells, and c is the number of columns the search is executed in. Consider joining instead! |
*IFS performed on multiple columns | SUMIF, SUMIFS, COUNTIFS | tens of seconds | No | Quadratic time O(m n c) where n is the number of cells in the search range, m is the number of dragged-down cells, and c is the number of columns the search is executed in. Consider joining, or group summing instead! |
Big O Notation is how computer scientists classify an algorithm's speed.
Pebblers do not have to understand Big (O) notation to write fast-running worksheets; however, if you are interested in how computer scientists think about performance, Wikipedia does a fantastic job of explaining how Big (O) analysis predicts a spreadsheets scalabilty.
Pebblers should use this table when designing worksheets. For instance, it is better to use the join directives to match rows of data in separate worksheets than to perform lookups. The improved performance is why join directives are preferred to lookups when designing spreadsheets with Pebble Stream.
In general, if you are looking to process millions of rows quickly, avoid unscalable operations, particularly quadratic and n-squared time operations.
Updated about 1 month ago