Examples | next

MD5 column

When working with large columns, performing indexed searches can be a challenge. One solution is to use an MD5 column to index the search. In this video, we will explore how an MD5 column can be used for an indexed search and provide step-by-step instructions on how to add an MD5 column to a MySQL table.

The basics

To understand how an MD5 column works, let's start with a basic example. Let's consider a simple table called urls that contains only two columns: id and url. The url column is a TEXT column because it can be a very large value consisting of many thousands of characters.

If we want to perform a strict equality lookup on this table where the URL equals a specific value, we could run the following SQL query:

SELECT * FROM URLs WHERE url = 'specific_value';

Since the url column is not indexed, this query would result in a full table scan, searching through all the huge text columns. Unfortunately we can't just add a regular index to a TEXT column.

MD5 hashes

One way to create an equality index for a TEXT column is to add an MD5 hash virtual generated column. An MD5 hash is a 128-bit value that can be calculated for any file, text string, or other data input.

To add an MD5 hash virtual generated column to the URLs table, we can run the following SQL query:

ALTER TABLE urls ADD COLUMN url_md5 CHAR(32) GENERATED ALWAYS AS (MD5(url));

This creates a new column called url_md5 that contains the MD5 hash of the URL column. This column can be indexed using a B-tree index.

To add an index to the new url_md5 column, we can use the following SQL query:

ALTER TABLE URLs ADD INDEX (`url_md5`);

Once indexed, we can perform an indexed search by running the following SQL query:

SELECT * FROM urls WHERE url_md5 = MD5('specific_value');

This query takes advantage of the index that was created on the url_md5 column, resulting in a much faster and more efficient search.

Binary strings

When creating an MD5 hash virtual generated column, it is important to specify the data type. Since we're only ever going to be comparing sets of bytes rather than characters, we can use BINARY(16) for a more efficient search.

ALTER TABLE urls ADD COLUMN url_md5 binary(16) GENERATED ALWAYS AS (UNHEX(MD5(url)));

Then we can add an index to the new binary url_md5 column using the following SQL query:

ALTER TABLE urls ADD INDEX (url_md5);

And perform an indexed search using the following SQL query:

SELECT * FROM urls WHERE url_md5 = UNHEX(MD5('specific_value'));

This binary search is more efficient and faster than the previous example using CHAR.

MD5 over multiple columns

Consider the example of a table that stores normalized addresses. The table consists of four columns – primary_line, secondary_line, urbanization, and last_line. The goal is to store unique addresses and only keep the pretty, normalized version of an address. However, since there is no natural ID or other unique identifier, we need to create our unique ID.

Creating the MD5 hash

To generate the MD5 hash, we create a new column, simply called MD5. Since the table includes binary 16, we will use that for the MD5 hash column. We can do this by running the following command:

ALTER TABLE addresses ADD COLUMN md5 BINARY(16) GENERATED ALWAYS AS (
  UNHEX(MD5(
    CONCAT_WS('|', primary_line, secondary_line, urbanization, last line)
  ))
);

Note that we are using MD5 hash concatenated with separator (|). The first value passed in is the separator. We use the CONCAT_WS function to combine all four columns and then generate the MD5 hash.

Making the MD5 hash unique

Since we want to ensure that we store only unique values, we can add a unique index on the MD5 column. This is done by running the following command:

ALTER TABLE addresses ADD UNIQUE INDEX (md5);

This ensures that we can perform strict equality checks across multiple columns since we have one logical unit with one logical ID created through this new column.

Binary vs. character columns

Note that because the MD5 column is a binary column (a string of bytes, not characters), we must use the UNHEX function to convert the characters to a binary string.

By using this method to create unique MD5 hashes from multiple columns, we can now enforce uniqueness and perform lookups faster than a composite index across multiple character columns. This also saves us from having to maintain a more complex index structure.

Creating an MD5 hash from multiple columns is a powerful tool that can be used in various applications to achieve fast and efficient search results.

Bitwise operations

We explored an alternative solution to using a JSON column to store multiple flags in a single field. Using a tiny integer column to store multiple bits of information in a single field is an efficient way to save space in the database. However, it comes with some tradeoffs, including reduced readability and additional application logic overhead. Whether or not to use this approach depends on your specific use case and the tradeoffs you are willing to make!

Storing flags in a TINYINT column — a tradeoff between space and readability

When designing a database, it is often necessary to store "flags," or true/false values. One way to do this is to create a separate column for each value, but as the number of flags increases, this approach becomes impractical. One solution is to use a JSON column to store all the flags in a single field. However, this approach comes at the cost of increased space usage. In this video, we explore an alternative solution that uses a tiny integer column to store multiple bits of information in a single field.

Using a tiny integer column

Suppose we have a users table and we want to add a column to store true/false flags. Instead of using a JSON column, we can use a tiny integer column to store multiple bits of information in a single field. Let's see how this works.

First, we need to add a column to the users table:

ALTER TABLE users ADD COLUMN flags TINYINT UNSIGNED DEFAULT 0;

Notice that we're setting the default value to 0 and making the column tinyint unsigned to store only positive integers.

After adding the column, we can store multiple flags in a single integer. To do this, we need to understand how bits and bytes work.

Bits and bytes

A byte is a unit of digital information that consists of eight bits. A bit is a binary digit that can be either 0 or 1. By manipulating the bits in an integer, we can store multiple flags in a single field.

For example, suppose we have the following eight flags:

dark_mode
super_admin
notification_opt_in
metered_billing
rollout_chat
experiment_blue
log_verbose
new_legal_disclaimer

We can assign each flag to a bit in the integer column as follows:

00000000
│││││││└─ Bit 1: dark_mode
││││││└── Bit 2: super_admin
│││││└─── Bit 3: notification_opt_in
││││└──── Bit 4: metered_billing
│││└───── Bit 5: rollout_chat
││└────── Bit 6: experiment_blue
│└─────── Bit 7: log_verbose
└──────── Bit 8: new_legal_disclaimer

Using this mapping, we can store up to eight flags in a single byte. To store multiple flags, we need to turn on the corresponding bits.

For example, suppose a user has turned on the dark_mode and rollout_chat flags. To represent this, we need to turn on bits 1 and 5. Using binary notation, this would look like this: 00010100.

   ┌───── Bit 5: rollout_chat            ┐
   │                                     ├─ These are set to 1, therefore true
   │   ┌─ Bit 1: dark_mode               ┘
00010001
│││ ││└── Bit 2: super_admin             ┐
│││ │└─── Bit 3: notification_opt_in     │
│││ └──── Bit 4: metered_billing         ├─ These are set to 0, therefore false
││└────── Bit 6: experiment_blue         │
│└─────── Bit 7: log_verbose             │
└──────── Bit 8: new_legal_disclaimer    ┘

To convert this to an integer value, we can add up the decimal value of each bit that is turned on. In this case, that would be 1 + 16 = 17.

00010001
   │   └─ dark_mode    = 1
   └───── rollout_chat = 16
                        ────
                         17

We can store this value, 17, in the flags column for this user.

Querying the flags

To query the flags, we can use bitwise operators. For example, to find all users who have the dark_mode flag turned on, we can use the following SQL statement:

SELECT * FROM users WHERE flags & 1 = 1;

In this statement, we use the bitwise AND operator (&) to compare the value of the flags column with the binary value of 1 (00000001), which represents the dark_mode flag. If the result of the bitwise AND operation is 1, it means that the dark_mode flag is turned on, and we include that user in the result set.

Similarly, to find all users who have both the dark_mode and rollout_chat flags turned on, we can use the following SQL statement:

SELECT * FROM users WHERE flags & 17 = 17;

In this statement, we use the bitwise AND operator (&) to compare the value of the flags column with the binary value of 17 (00010100), which represents both the dark_mode and rollout_chat flags. If the result of the bitwise AND operation is 17, it means that both flags are turned on, and we include that user in the result set.

Tradeoffs

Using a tiny integer column to store multiple flags in a single field is an efficient way to save space in the database. However, it comes with some tradeoffs.

Reduced readability: The binary representation of flags in the flags column is not human-readable. It can be difficult to interpret the meaning of the flags without consulting a mapping table.
Limited number of flags: The number of flags that can be stored in a single byte is limited to eight. If we need to store more flags, we need to use a larger integer column or a JSON column.
Application logic overhead: To use the bitwise operators to query the flags, we need to write additional application logic that maps the flags to their corresponding bit values. This can add complexity to the codebase.

Timestamps versus booleans

As a developer, one of the challenges you face regularly is deciding which data type to use in your database. While the choice often depends on the specifics of the project and its requirements, it is essential to weigh the advantages and disadvantages of each option. In this video, we'll explore using timestamps instead of booleans and the tradeoffs that come with them.

To better illustrate this, let's consider an example where we'll create a table named "posts." This table will have several columns, but for demonstration, we'll focus on the "is_archived" column, which indicates whether a post has been archived.

CREATE TABLE posts(
  title VARCHAR(125),
  -- ....
  is_archived BOOLEAN
);

Initially, we set the "is_archived" column as boolean, which is a useful data type. It allows us to store a true/false value in a single bit, providing an optimal storage solution for storing binary values. However, using timestamps instead of booleans has its benefits too.

Benefit of using timestamps instead of booleans

When we change the data type to timestamps, we get access to an extra piece of information. Instead of just storing a boolean value, we get to store an archive timestamp, which tells us when the post was archived.

CREATE TABLE posts(
  title VARCHAR(125),
  -- ....
  archived_at timestamp null
);

With this in place, we can still use the archived_at column as if it is a boolean value with the following query:

SELECT * FROM posts WHERE archived_at IS NULL;

This query returns all unarchived posts, which is equivalent to the boolean query

SELECT * FROM posts WHERE is_archived = false;

However, the archived_at column provides us with more information, namely the time at which the post was archived, which could be useful in the long run.

The tradeoffs of using timestamps

While using timestamps provides us with extra information, it has a few tradeoffs that we must consider.

A timestamp column uses four times more storage space than a boolean column, which uses only one byte. As a result, when using timestamps, we need to be mindful of the amount of storage space we use. In a large database, the extra space usage can be significant but it usually doesn't make too much of a difference.

If we decide to put an index on the is_archived column, it is unlikely to be selective enough to benefit the query. Most posts won't be archived, so querying for posts that aren't archived will not use the index. This is another reason why using a slightly larger column doesn't matter that much, as it's unlikely to be indexed.

Claiming rows

In some situations, multiple workers may need to claim rows in a database, indicating that they are currently working on them and others should not interfere. This can be especially common with sets of unprocessed rows, such as imports or emails. While MySQL may not be the best queue driver for this type of scenario, it is certainly possible to use it to claim rows in a lightweight way.

Let's take a look at how to claim rows in a MySQL database and ensure that each row is only claimed once.

Setting up the imports table

We'll start by creating an imports table that will hold unprocessed rows. This table will have columns for a file name, owner, available, started at, and finished at timestamps.

CREATE TABLE imports (
  id INT NOT NULL AUTO_INCREMENT,
  filename VARCHAR(255),
  owner INT DEFAULT 0,
  available TINYINT DEFAULT 1,
  started_at TIMESTAMP,
  finished_at TIMESTAMP,
  PRIMARY KEY (id),
  INDEX available_owner (available, owner)
);

Note that we have included a composite index on the available and owner columns in order to make our queries more efficient.

Selecting available rows

Our first step is to select any available rows for our workers to claim. We can do this by running the following query:

SELECT
  *
FROM
  imports
WHERE
  available = 1
LIMIT 1;

This query retrieves the most recent unclaimed row from the table. However, it is important to note that this method is flawed, as it does not guarantee that another worker has not already claimed this row by the time our worker updates it.

Updating claimed rows

Instead of selecting a row and then updating it, we can issue a blind update that claims the row as soon as it is found. We can do this using the following query:

UPDATE imports
SET
  owner = 32, -- unique worker id
  available = 0
WHERE
  owner = 0
  AND
  available = 1
LIMIT 1;

In this example, we are claiming the row for worker #32 by setting the owner column to 32 and the available column to 0. On your application side, you would need to make sure each worker process has a unique id. We are also only claiming one row at a time with the LIMIT 1 clause.

Checking claimed rows

Once a row has been claimed, we can ensure that only the owner of the row is able to modify or process it. We can do this by selecting rows where the owner is equal to the worker ID.

SELECT
  *
FROM
  imports
WHERE
  owner = 32;

This query will only return rows where the owner is equal to 32, indicating that worker #32 is currently working on that row.

Summary tables

As businesses collect more and more data, it can become difficult to manage and process all of it in a timely manner. One optimization technique that can help reduce the burden on your database is using a summary table, also known as a roll-up table. This involves taking large sets of historic data and rolling it up into a smaller, more manageable summary table. In this video, we'll explore how to create a summary table and integrate it into your existing database.

By using a summary table, we can greatly reduce the amount of data processing required by our query. We've demonstrated how to filter and group large data sets, create a new summary table, and roll up data into it. We've also combined live and historical data using UNION and CTE statements. Taking the time to optimize your database in this way can result in faster queries and a more efficient system overall.

Getting started: filtering your data

Let's start by examining an example payments table. Our goal is to roll up old months and put them into a summary table. To begin, we need to filter down to just older months:

SELECT
  amount,
  YEAR(payment_date),
  MONTH(payment_date)
FROM
  payments
WHERE
  payment_date < DATE_FORMAT(CURRENT_DATE, '%Y-%m-01')

Here, we're selecting the amount of the payment as well as the year and month that the payment was made. By limiting the results to payment_date values that are less than the first day of the current month, we ensure that we're only selecting data from previous months.

Grouping your data

Once we have our filtered data, we need to group it by year and month so we can roll it up into the summary table:

SELECT
  sum(amount) as amount,
  YEAR(payment_date) as `year`,
  MONTH(payment_date) as `month`
FROM
  payments
WHERE
  payment_date < DATE_FORMAT(CURRENT_DATE, '%Y-%m-01')
GROUP BY
  `year`, `month`

This gives us a sum of the amount for each month that is in our historical data set.

Creating your summary table

Now it's time to create the summary table itself. We define the schema using the following statement:

CREATE TABLE payment_summary (
  id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
  amount DECIMAL(9,2),
  `year` YEAR,
  `month` TINYINT UNSIGNED
);

We'll use this table to store our rolled-up historical data. Note that we've defined the year column to use the YEAR data type, which is typically not used since it only stores four-digit years. In this case it is actually useful!

Populating your summary table

To insert the data from our filtered query into the summary table, we can use the following statement:

INSERT INTO payment_summary (amount, year, month) SELECT
  sum(amount) as amount,
  YEAR(payment_date) as `year`,
  MONTH(payment_date) as `month`
FROM
  payments
WHERE
  payment_date < DATE_FORMAT(CURRENT_DATE, '%Y-%m-01')
GROUP BY
  `year`, `month`

This will populate the summary table with the historic data we've rolled up.

Combining your summary table with live data

The last step is to incorporate this summary table with live data. We want to ensure that the summary table is up-to-date, but we also want to be able to query both historic and current data together. Here's one way we can do it:

SELECT
  amount,
  year,
  month
FROM
  payment_summary

UNION ALL

SELECT
  sum(amount) as amount,
  YEAR(payment_date) as `year`,
  MONTH(payment_date) as `month`
FROM
  payments
WHERE
  payment_date >= DATE_FORMAT(CURRENT_DATE, '%Y-%m-01')

This combines our rolled-up historical data with our current data. Note that we're using UNION ALL instead of UNION here, which preserves duplicate rows (of which there should be none in our case). We're using the same extract statement as before, but with the opposite condition to ensure that we're only selecting data that's been added this month.

Using a common table expression

Finally, we can use a common table expression (CTE) to treat this entire query as a single table:

WITH payment_data AS (
  SELECT
    amount,
    year,
    month
  FROM
    payment_summary

  UNION ALL

  SELECT
    sum(amount) as amount,
    YEAR(payment_date) as `year`,
    MONTH(payment_date) as `month`
  FROM
    payments
  WHERE
    payment_date >= DATE_FORMAT(CURRENT_DATE, '%Y-%m-01')
)

SELECT * FROM payment_data

Here, we wrap our combined query in a WITH statement to define a new CTE called payment_data. We can then select from this CTE as if it were a real table.

Meta tables

As your database grows, you may start to notice that certain tables are becoming increasingly wide, with more and more columns. While this may seem like a convenient way to store all of your data in one place, it can actually slow down your database performance. This is because when a table is very wide, each row consumes more disk space, which means the rows are not packed together as efficiently on disk. To combat this, we can use a technique called a "supplement table" or a "meta table", which involves shuffling less frequently used columns off into a separate table to keep each row short and improve disk access performance.

By shuffling less frequently used columns off into a separate table, you can keep your rows short and improve disk access times. While there are potential trade-offs, such as added complexity and the need for inner joins to reconstitute full rows, supplement tables can be a valuable tool for optimizing your database design.

What is a supplement table?

A supplement table, also known as a meta table or addendum table, is a secondary table that contains columns that aren't used very often. By breaking a very wide table into more focused groups of columns, you can optimize your database for faster read and write times. Of course, like any optimization, there are trade-offs. While supplement tables can help improve performance, they can also add complexity to your database design, making it harder to manage and update.

When to use supplement tables?

The decision to use supplement tables depends on the size and complexity of your database, as well as the specific needs of your application. A good rule of thumb is to consider using supplement tables when you have a very long table (lots of rows) that is also very wide (lots of columns), and there are many columns that are not used very often.

How to create a supplement table

Creating a supplement table is fairly straightforward. First, identify which columns are frequently used (i.e. the "hot" set of columns) and which columns are not. Then, create a new table containing only the "hot" column set, leaving the less frequently used columns in the original table. You can then join the two tables as needed using the ID column as a key.

For example, if you have a film catalog with a wide film table, you could create a film_narrow table containing only the title, rental_rate, and rating columns as the "hot" set of columns. The other columns, such as the description, could be moved to a film_addendum table. You can then join the two tables as needed using an inner join on the ID column.

SELECT
  *
FROM
  film_narrow
INNER JOIN film_addendum ON film_narrow.id = film_addendum.film_id

Potential issues with supplement tables

While supplement tables can improve database performance, they can also add complexity to your application design. You would have to change your queries on your application side to ensure you got back all the columns you needed. Additionally, when issuing updates you'd need to make sure you target the correct table.

Despite these potential issues, supplement tables can be an effective way to optimize your database and improve performance. By keeping your rows short and your hot column set focused, you can reduce your disk I/O and speed up your database access times.

Offset limit pagination

Pagination is necessary when you have a vast amount of records to show, but you do not want to overwhelm the user with all the data at once. Instead, you break the records into smaller, more manageable chunks called "pages." In this video, we will examine one of two methods used for pagination: limit/offset pagination.

Limit/offset is easy to use and ideal for directly addressable pages, but it is less well suited for records that frequently shift around.

No matter which method you choose, deterministic ordering for your records is critical, and it ensures that your users receive consistent, accurate data.

Deterministic ordering is key

Before diving into the different methods of pagination, it is important to emphasize the significance of ensuring deterministic ordering. If your ordering is not deterministic, your records may not show up correctly, which can lead to confusion for your users. For pagination to be effective, you must order your records in a stable and deterministic manner.

The limit offset method

Let's start by examining the limit offset method. With this method, the first step is to select records from the database. You should then order the records by a field (or multiple fields!) to ensure deterministic ordering. The next step is to specify the page size by using the LIMIT keyword followed by the number of records you want to show per page. Finally, you specify the offset by using the OFFSET keyword followed by the starting position for the current page.

Here's an example:

SELECT
  *
FROM
  people
ORDER BY
  birthday,
  id --
LIMIT 100
OFFSET 0;

In this example, we are selecting all records from the people table, ordering by the birthday column, and showing 100 records per page. The OFFSET is set to 0 because we are on the first page.

Strengths of limit/offset method

One significant advantage of the limit offset method is that it is user-friendly and easy to implement. You can create a simple query to construct the pagination, and users can jump directly to the page they want to view by using their desired offset number. This method provides directly addressable pages, so if users want to jump to a certain page, they can do so quickly and easily.

Drawbacks of limit/offset method

One significant limitation of limit offset is that the page numbers can drift as you navigate through the records. For example, if a record is deleted from your current page, it may cause a record to shift from the next page to the current page, leading to confusion. Moreover, as you navigate deeper into the records, the method becomes significantly more expensive, meaning the database has to do more work to fetch that specific page.

Check if exist another page

select count(*) from people / pagesize;

This is a simpler solution, however you can also verify if the last user exists with a limit + 1, as that would indicate the existence of another page.

Cursor pagination

Best for infinite scroll.

When it comes to pagination, developers often turn to either offset limit or cursor-based pagination. While offset limit can be simple to implement, it has a few drawbacks that make it less than ideal for certain situations. Cursor-based pagination aims to address some of these flaws but comes with its own set of challenges.

Before diving deeper into cursor-based pagination, it's important to note that we're not talking about database cursors. While database cursors are interesting, they're not what we're discussing here. Cursor-based pagination references a method of pagination that is different from offset limit.

select
  *
from
  people
where
  id > 5
order by
  id -- ?cursor=5
limit
  5

select
  *
from
  people
where
  birthday > '1973-02-04'
  or
  (birthday = '1973-02-04' and id > 31766)
order by
  birthday,
  id -- ?cursor=(id=31766&birthday=1973-02-04)
limit
  5

What is cursor-based pagination?

Cursor-based pagination allows for the ability to efficiently retrieve large datasets from a database by breaking them down into smaller pages. This method is particularly useful when working with large datasets where loading all the data at once would be impractical or slow.

When implementing cursor-based pagination, developers need to keep track of the last record that the user saw. To accomplish this, a "cursor" is sent out to the front-end with each page of results. The cursor then comes back to the database as a token, indicating where the next page of results should start.

Pros and cons of cursor-based pagination

Benefits

One of the advantages of cursor-based pagination is its resilience to shifting rows. For example, if a record is deleted, the next record that would have followed is still displayed since the query is working off of the cursor rather than a specific offset.

Another benefit of cursor-based pagination is that it can work well with infinite scroll, a design trend that loads content as the user scrolls.

Drawbacks

One of the primary downsides of cursor-based pagination is that it's impossible to directly address a specific page. For instance, if the requirement is to jump directly to page five, it's not possible to do so since the pages themselves are not explicitly numbered.

Additionally, cursor-based pagination can be more complicated to implement than offset limit pagination. More thought needs to be put into the structure of the cursor and what criteria should be used to determine it.

How to implement cursor-based pagination

Cursor-based pagination requires maintaining state and intelligently reconstructing that state on demand. To achieve this, developers need to:

Order the query results for the current page using a unique identifier like an ID or a combination of multiple fields.
Calculate the "cursor" based on the last record.
Send the results and the cursor back to the user.
On the next request, extract the cursor, deconstruct the state from it, and construct the query.

Deferred joins

An optimization solution for deeper pagination

Pagination is essential for any website or application that displays a large amount of data. It allows users to navigate the content more easily and intuitively. However, traditional offset limit based pagination can become slower as users go deeper into the results. This is where the deferred join technique comes in handy.

The deferred join technique is an excellent solution for websites or applications that require deep pagination. It's more efficient than traditional pagination and can be optimized further with indexes.

Although there are libraries available for popular web frameworks, it's still useful to know how to implement it manually. By understanding the principles behind the technique, you'll be able to apply it in other contexts where libraries aren't available.

What is the deferred join technique?

The deferred join technique is an optimization solution that enables pagination in a more efficient way. It works by performing the pagination on a subset of the data, instead of the entire table. This subset is generated by a subquery, which is joined with the original table later on. The technique is called "deferred" because the join operation is postponed until after the pagination is done.

This technique has been widely adopted, and there are libraries available for popular web frameworks such as Rails ((FastPage)[https://planetscale.com/blog/fastpage-faster-offset-pagination-for-rails-apps]) and Laravel ((Fast Paginate)[https://github.com/aarondfrancis/fast-paginate]). However, in this post, we'll show you how to implement it manually, step by step.

The problem with traditional pagination

Let's start with an example of traditional pagination. Suppose we have a table called "people," and we want to display the results sorted by birthday ID. We'll limit the results to 20 records and offset them by 450,000.

Here's what the query would look like:

SELECT
  *
FROM
  people
ORDER BY
  birthday, id
LIMIT 20
OFFSET 450000;

When we run this query, it takes around 600-700 milliseconds to execute. As the number of offset increases, the query becomes slower and slower.

Implementing the deferred join technique

Now, let's implement the deferred join technique and see how it performs.

The first step is to create a subquery that generates the subset of data that we'll use for pagination. The subquery should have a simple SELECT statement that retrieves only the ID column:

SELECT id FROM people;

Our main query will then join this subquery with the original table using the ID column as a matching criterion:

SELECT * FROM people
INNER JOIN (
  SELECT id FROM people ORDER BY birthday, id LIMIT 20 OFFSET 450000
) AS people2 USING (id)
ORDER BY
  birthday, id

Here, we're using the USING clause to specify the matching column between the main table and the subquery. We're also sorting the results by birthday in ascending order.

If we run this query, we'll see that it takes only 200 milliseconds to execute. That's around three times faster than the traditional pagination!

Why does the deferred join technique work?

The deferred join technique may seem counterintuitive at first, but it's actually quite simple. By generating a subset of data that contains only the ID column, we're able to apply the pagination on a much smaller dataset. This means we're throwing away less work.

When we join this subset with the main table, we're able to retrieve only the rows that match our pagination criteria. The join operation is more efficient than filtering the whole table with a large offset.

Taking the optimization further with indexes

To further optimize the deferred join technique, we can use indexes. If we add an index to the birthday column, for example, we can make the inner subquery use a covering index. This means that our pagination can be done entirely on the index without retrieving the actual data.

Here's how we can add an index on the birthday column:

ALTER TABLE people ADD INDEX birthday (birthday);

When we run our optimized query again, we'll see that it takes only 60 milliseconds to execute! That's a full ten times faster than the traditional pagination.

Geographic searches

MySQL provides powerful geometry and geographic functions that are easy to use and helpful for many applications. Using a simple latitude and longitude column in combination with efficient query optimization techniques such as approximate and redundant conditions, we can do fast and accurate geographical searching.

Please note that for more advanced scenarios, points columns, geometry columns, and other specialized columns are available in MySQL. However, for basic geographical searching, the latitude and longitude columns method is a lightweight and straightforward alternative to these complex features.

Basic geographical searching

MySQL has a special data type for a point column, but for this article, we will be using a latitude and longitude column. This method is straightforward and easy to understand but please note that there are more advanced methods for advanced use cases.

Let's start by taking a closer look at these latitude and longitude columns. Suppose we have a table with one million addresses, each of which has a latitude and longitude column. To search for these addresses based on distance, we can use the stDistanceSphere function which calculates the distance between two points on a sphere.

For example

SELECT stDistanceSphere(
  point(lat1, long1),
  point(lat2, long2)
)

gives us the distance between two points on the sphere. It is important to note that the calculation is in meters, and we can use this function to make a simple comparison such as whether the distance is less than a specified value.

SELECT
  *
FROM
  addresses
WHERE
  ST_Distance_Sphere(
    POINT(-97.745363, 30.324014),
    POINT(longitude, latitude)
  ) < 1609

In this query, we are searching for all points within one mile of the given point, which is defined by a hardcoded lat-long value.

The query calculates the distance between the hardcoded lat-long value and every row in the table. If the distance is less than 1609 meters (one mile), it is included in the results.

Running this query with a million rows can become slow because the query calculates the distance for every row, even if they are farther away than one mile. Therefore, we need to make an approximate condition that acts as an initial filter and eliminate the false positives later.

Approximate and redundant conditions

Making an approximate condition to quickly filter out rows that are far away from the given point is a popular technique to optimize geographic searching. As mentioned earlier, calculating the distance between every row and the given point can be a time-consuming process. Therefore, we can use a bounding box and filter the rows that fall within that box.

Filtering rows with bounding box

The first step is to calculate the actual bounding box based on the location of the given point. We can use off the shelf functions to do this task. Using the returned latitude range and longitude range will define the box.

select
  *
from
  addresses
where
  latitude between 30.30954084441 and 30.33848715559    -- Bounding box latitude
  and
  longitude between -97.76213017291 and -97.72859582708 -- Bounding box longitude
  and
    ST_Distance_Sphere(
      point(-97.745363, 30.324014),
      point(longitude, latitude)
    ) <= 1609;

Next, we add an index to our table to cover either latitude or longitude — which will work equally well. MySQL will use the key for the bounding box filter, and after getting a small subset of rows, it will run the expensive ST_Distance_Sphere function.

ALTER TABLE addresses ADD INDEX idx_latitude (latitude);

We can confirm that the index works faster by checking the query's execution time, and we should see a significant improvement. Approximate and redundant conditions with indexing can vastly increase the speed of otherwise slow operations.