How Does SQLite Work?

2022-01-21

This is going to be fun.

Recently I wrote about passwords and ssh, but I'm going to write about SQLite a little differently, because I know it reasonably well and have worked with it for any number of side projects.

So rather than a 101 guide or a tutorial, this is going to be more of an SQLite spelunk, a technical mukbang if you will. For the next hour or so, I'll shout out every question I've had in the back of my mind, and we'll see how many I can answer before I go to bed.

The basics

What is SQLite? A free, fast, and public-domain RDBMS that lives on your filesystem as opposed to some separate process.

What does RDBMS mean? It stands for relational database management system. An RDBMS structures its data kind of like an Excel spreadsheet: you have tables with rows, and columns, and these columns have different relationships to each other. For example, you might have a table students with columns id and name, and you might have a table classes with columns id and title. Then you might model which students are in which classes with a table like student_enrollment with columns student_id and class_id.

What does SQL mean? It stands for structured query language. It's the language we use to tell an RDBMS what data we need or which changes we want to make. There are some small variants of SQL (e.g. HiveQL for querying Hadoop data), but they all have a similar syntax, e.g.:

1	`select id, name from students where name = 'Arun Prasad';`

How many RDBMSes are there? I know of only three: SQLite, PostgreSQL, and MySQL.

Who uses RDMBSes? They're the database of choice for most of the major web services you use day to day. According to one article the users of PostgreSQL include Apple, Reddit, Spotify, Twitch, Instagram, and Skype. And according to the MySQL website the users of MySQL include Airbnb, YouTube, Netflix, Tesla, Square, Box, and more.

So why use SQLite? It's easy to set up and has great performance for embedded use cases and for low- to moderate-traffic websites (single server, less than ~1M hits per day). See this analysis from the SQLite team for more analysis and explanation.

System architecture

What does SQLite actually look like? — An RDBMS has two main tasks. First, it needs to store its data in a way that it won't get corrupted. Second, it needs to make that data readable and writable through SQL. Here's the SQLite system architecture, via sqlite.org:

SQLite architecture diagram

Let's start with the Core. It receives a query through its C-language interface and passes it to the SQL Command Processor, which wraps a pretty standard language compiler:

The input SQL is tokenized into strings, keywords, numbers, etc.
That token data is assembled into a syntactically valid parse tree.
The parse tree is then converted to generated code, in this case into bytecode.

That bytecode is then executed on the Virtual Machine and run against the SQLite backend. Data is stored on disk in a B-Tree data structure, which requests data from a Pager cache that in turns queries an OS Interface that wraps OS-specific functionality.

Core

Why use bytecode and a virtual machine? SQLite is written in C but it needs to understand queries written in SQL. So one way or another, that SQL must be able to run as C code. One way to do this is to break down an SQL query into the specific instructions (i.e. bytecode) that need to be run for the query to work then run those instructions through an emulator (i.e. virtual machine) written in C.

What does bytecode look like? — Here's an example from the docs:

sqlite> explain delete from tbl1 where two<20;
addr  opcode         p1    p2    p3    p4             p5  comment      
——  ——————-  ——  ——  ——  ——————-  —  ——————-
0     Init           0     12    0                    00  Start at 12  
1     Null           0     1     0                    00  r[1]=NULL    
2     OpenWrite      0     2     0     3              00  root=2 iDb=0; tbl1
3     Rewind         0     10    0                    00               
4       Column         0     1     2                    00  r[2]=tbl1.two
5       Ge             3     9     2     (BINARY)       51  if r[2]>=r[3] goto 9
6       Rowid          0     4     0                    00  r[4]=rowid   
7       Once           0     8     0                    00               
8       Delete         0     1     0     tbl1           02               
9     Next           0     4     0                    01               
10    Noop           0     0     0                    00               
11    Halt           0     0     0                    00               
12    Transaction    0     1     1     0              01  usesStmtJournal=0
13    TableLock      0     2     1     tbl1           00  iDb=0 root=2 write=1
14    Integer        20    3     0                    00  r[3]=20      
15    Goto           0     1     0                    00

I don't understand these instructions at all. — The opcode is the name of the instruction, and p1 to p5 are its operands. These operands are often pointers to areas within tables or b-trees. Here's the full reference for SQLite opcodes.

Roughly, the instructions above read through the table entry by entry and delete the row wherever the condition is matched. I looked some of these opcodes up, and here's what I found:

2 OpenWrite: creates a read/write pointer to the table whose root page is pointed to by p2.
3 Rewind jumps to instruction *p2 if the table pointed to by p1 is empty. Otherwise it continues to the next instruction, which starts at the beginning of the table.
5 Ge compares the values in register p1 and p3. If *p1 >= *p3, jump to instruction p2.
8 Delete deletes the record that p1 points to.
9 Next increments the p1 pointer (thus pointing to the next row) and jumps to *p2 if a next row was found. Next defines the loop that performs the database scan.
13 TableLock: obtains a lock in the database pointed to by p1 and the table pointed to by p2. p3 specifies whether this is a readlock (0) or a writelock (1). p4 points to the table's name in case an error message needs to be generated.

But how is SQL converted to these opcodes? — That's in the code generator, where most of the serious magic happens. For example, here's the code generator for delete statements.

Backend (b-trees)

What's a b-tree? — It's a self-balancing version of the classic binary search tree, except that nodes can have up to m children, where m is some integer we choose. They're very handy for databases and filesystems. Bayer and McCreight invented the b-tree at Boeing around 1970.

Who uses them? — Quoth Wikipedia,

The B-tree remains the standard index implementation in almost all relational databases, and many nonrelational databases use them too.

What does the "b" stand for? — The answer will shock and dazzle you:

Bayer and McCreight never explained what, if anything, the B stands for: Boeing, balanced, broad, bushy, and Bayer have been suggested. McCreight has said that "the more you think about what the B in B-trees means, the better you understand B-trees."

What a troll answer. — Right?

Why not just use a binary search tree? — Think of a balanced binary search tree with around 1000 nodes. Searching for an arbitrary node in this tree will take around 10 steps (log_2(1000) ~= 10). This is usually fine if the tree is in memory, but if we need to search for data on disk, each step is enormously expensive. Since b-tree nodes support more than two children, they make it substantially easier to search the data with fewer disk reads required.

How does b-tree insertion work? — We start at the root and search for the leaf node where the item should be inserted. If the node isn't at capacity, we insert and we're done. If it is, split the node at the median. (This may recursively cause the parent node to be split, and likewise all the way to the root.)

How does b-tree deletion work? — We start at the root and search for the leaf node that contains the item. If deleting the item reduces the node below some minimum capacity, we borrow nodes from siblings if possible (this is called rotation) or merge with siblings if not (this is just called merging). Then we recursively apply this procedure, as before. Full details are on Wikipedia.

How do b-trees relate to indexing? — SQLite uses table b-trees to store table data and index b-trees to store indices into those tables. I'm not sure how index b-trees work, but I think they map their index key (e.g. a student name) to specific pages on disk.

Backend (file format)

How is data stored on disk? — As one big file.

Really? — The database is split into multiple pages, all of which have 2^n bytes where 9 <= n <= 16. All of a database's pages have the same size. The first page contains some basic metadata.

What does the metadata do? — It defines the file format, page size, version information, and other database settings. It also points to a table b-tree that defines the database schema.

How are rows actually laid out on disk? — Each entry in the table b-tree corresponds to a row of the SQL table. The row is just a byte array with the same column order as in the table definition.

Is adding a new column expensive? — It's cheap because the column is just appended to the end of the row, and if it has no data, it doesn't require any change to the underlying payloads:

No changes are made to table content for renames or column addition without constraints. Because of this, the execution time of such ALTER TABLE commands is independent of the amount of data in the table and such commands will run as quickly on a table with 10 million rows as on a table with 1 row

Miscellaneous questions

Why does SQLite use dynamic typing? — In the team's view, this is a feature, not a bug, and the rationale is the same as for any dynamically typed system (Python, JSON, etc): convenience, especially when dealing with complex structured data. However, there is a STRICT mode that supports some basic types.

What's up with the SQLite license? — It's public domain, which is rare.

And the Code of Ethics? — It's eccentric, but more power to them for boldly stating the beliefs they live by.

Endnotes

Thanks to the SQLite team for creating such an awesome library. This post took about 100 minutes.