How Does SQLite Work?
2022-01-21
This is going to be fun.
Recently I wrote about passwords and ssh, but I'm going to write about SQLite a little differently, because I know it reasonably well and have worked with it for any number of side projects.
So rather than a 101 guide or a tutorial, this is going to be more of an SQLite spelunk, a technical mukbang if you will. For the next hour or so, I'll shout out every question I've had in the back of my mind, and we'll see how many I can answer before I go to bed.
The basics
What is SQLite? A free, fast, and public-domain RDBMS that lives on your filesystem as opposed to some separate process.
What does RDBMS mean? It stands for relational database management
system. An RDBMS structures its data kind of like an Excel spreadsheet: you
have tables with rows, and columns, and these columns have different
relationships to each other. For example, you might have a table students
with columns id
and name
, and you might have a table classes
with columns
id
and title
. Then you might model which students are in which classes with
a table like student_enrollment
with columns student_id
and class_id
.
What does SQL mean? It stands for structured query language. It's the language we use to tell an RDBMS what data we need or which changes we want to make. There are some small variants of SQL (e.g. HiveQL for querying Hadoop data), but they all have a similar syntax, e.g.:
1 |
|
How many RDBMSes are there? I know of only three: SQLite, PostgreSQL, and MySQL.
Who uses RDMBSes? They're the database of choice for most of the major web services you use day to day. According to one article the users of PostgreSQL include Apple, Reddit, Spotify, Twitch, Instagram, and Skype. And according to the MySQL website the users of MySQL include Airbnb, YouTube, Netflix, Tesla, Square, Box, and more.
So why use SQLite? It's easy to set up and has great performance for embedded use cases and for low- to moderate-traffic websites (single server, less than ~1M hits per day). See this analysis from the SQLite team for more analysis and explanation.
System architecture
What does SQLite actually look like? — An RDBMS has two main tasks. First, it needs to store its data in a way that it won't get corrupted. Second, it needs to make that data readable and writable through SQL. Here's the SQLite system architecture, via sqlite.org:
Let's start with the Core. It receives a query through its C-language interface and passes it to the SQL Command Processor, which wraps a pretty standard language compiler:
- The input SQL is tokenized into strings, keywords, numbers, etc.
- That token data is assembled into a syntactically valid parse tree.
- The parse tree is then converted to generated code, in this case into bytecode.
That bytecode is then executed on the Virtual Machine and run against the SQLite backend. Data is stored on disk in a B-Tree data structure, which requests data from a Pager cache that in turns queries an OS Interface that wraps OS-specific functionality.
Core
Why use bytecode and a virtual machine? SQLite is written in C but it needs to understand queries written in SQL. So one way or another, that SQL must be able to run as C code. One way to do this is to break down an SQL query into the specific instructions (i.e. bytecode) that need to be run for the query to work then run those instructions through an emulator (i.e. virtual machine) written in C.
What does bytecode look like? — Here's an example from the docs:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
I don't understand these instructions at all. — The opcode
is the name
of the instruction, and p1
to p5
are its operands. These operands are often
pointers to areas within tables or b-trees. Here's the full
reference for SQLite opcodes.
Roughly, the instructions above read through the table entry by entry and delete the row wherever the condition is matched. I looked some of these opcodes up, and here's what I found:
-
2 OpenWrite
: creates a read/write pointer to the table whose root page is pointed to byp2
. -
3 Rewind
jumps to instruction*p2
if the table pointed to byp1
is empty. Otherwise it continues to the next instruction, which starts at the beginning of the table. -
5 Ge
compares the values in registerp1
andp3
. If*p1 >= *p3
, jump to instructionp2
. -
8 Delete
deletes the record thatp1
points to. -
9 Next
increments thep1
pointer (thus pointing to the next row) and jumps to*p2
if a next row was found.Next
defines the loop that performs the database scan. -
13 TableLock
: obtains a lock in the database pointed to byp1
and the table pointed to byp2
.p3
specifies whether this is a readlock (0
) or a writelock (1
).p4
points to the table's name in case an error message needs to be generated.
But how is SQL converted to these opcodes? — That's in the code generator,
where most of the serious magic happens.
For example,
here's the code generator for delete
statements.
Backend (b-trees)
What's a b-tree? — It's a self-balancing version of the classic binary search tree, except that nodes can have up to m children, where m is some integer we choose. They're very handy for databases and filesystems. Bayer and McCreight invented the b-tree at Boeing around 1970.
Who uses them? — Quoth Wikipedia,
The B-tree remains the standard index implementation in almost all relational databases, and many nonrelational databases use them too.
What does the "b" stand for? — The answer will shock and dazzle you:
Bayer and McCreight never explained what, if anything, the B stands for: Boeing, balanced, broad, bushy, and Bayer have been suggested. McCreight has said that "the more you think about what the B in B-trees means, the better you understand B-trees."
What a troll answer. — Right?
Why not just use a binary search tree? — Think of a balanced binary search
tree with around 1000 nodes. Searching for an arbitrary node in this tree will
take around 10 steps (log_2(1000) ~= 10
). This is usually fine if the tree is
in memory, but if we need to search for data on disk, each step is enormously
expensive. Since b-tree nodes support more than two children, they make it
substantially easier to search the data with fewer disk reads required.
How does b-tree insertion work? — We start at the root and search for the leaf node where the item should be inserted. If the node isn't at capacity, we insert and we're done. If it is, split the node at the median. (This may recursively cause the parent node to be split, and likewise all the way to the root.)
How does b-tree deletion work? — We start at the root and search for the leaf node that contains the item. If deleting the item reduces the node below some minimum capacity, we borrow nodes from siblings if possible (this is called rotation) or merge with siblings if not (this is just called merging). Then we recursively apply this procedure, as before. Full details are on Wikipedia.
How do b-trees relate to indexing? — SQLite uses table b-trees to store
table data and index b-trees to store indices into those tables. I'm not sure
how index b-trees work, but I think they map their index key (e.g. a student
name
) to specific pages on disk.
Backend (file format)
How is data stored on disk? — As one big file.
Really? — The database is split into multiple pages, all of
which have 2^n
bytes where 9 <= n <= 16
. All of a database's pages have the
same size. The first page contains some basic metadata.
What does the metadata do? — It defines the file format, page size, version information, and other database settings. It also points to a table b-tree that defines the database schema.
How are rows actually laid out on disk? — Each entry in the table
b-tree corresponds to a row of the SQL table.
The row is just a byte array
with the same column order as in the table definition.
Is adding a new column expensive? — It's cheap because the column is just appended to the end of the row, and if it has no data, it doesn't require any change to the underlying payloads:
No changes are made to table content for renames or column addition without constraints. Because of this, the execution time of such ALTER TABLE commands is independent of the amount of data in the table and such commands will run as quickly on a table with 10 million rows as on a table with 1 row
Miscellaneous questions
Why does SQLite use dynamic typing? — In the team's view, this is a feature, not a bug, and the rationale is the same as for any dynamically typed system (Python, JSON, etc): convenience, especially when dealing with complex structured data. However, there is a STRICT mode that supports some basic types.
What's up with the SQLite license? — It's public domain, which is rare.
And the Code of Ethics? — It's eccentric, but more power to them for boldly stating the beliefs they live by.
Endnotes
Thanks to the SQLite team for creating such an awesome library. This post took about 100 minutes.