Queries

Introduction to queries


EXPLAIN overview

The EXPLAIN statement is a powerful tool that developers can use to analyze and optimize MySQL queries. It provides valuable information on how the MySQL engine accesses the table data, and can help you identify issues with slow query performance.

When we run an EXPLAIN statement, MySQL provides us with an overview of how the query is going to be executed, including the type of access it uses to reach the data, the index used, and the estimated number of rows accessed. With this information, we can determine which queries need optimization to improve performance.

    EXPLAIN SELECT * FROM people;

This statement returns output that shows how MySQL accessed the table:

  • ID: A unique identifier for the query being executed.

  • Select Type: Tells us the type of select statement is being executed. This can be simple, primary, union, or a few others.

  • Table: The name of the table being accessed.

  • Partitions: Displays the partitions being accessed for the query (beyond the scope of this course).

  • Type: The kind of access MySQL used to retrieve the data. This is one of the most important column values.

  • Possible Keys: The possible indexes that MySQL could use.

  • Key: The actual index that MySQL uses.

  • Key Length: Displays the length of the index used by MySQL.

  • Ref: The value being compared to the index.

  • Rows: An estimated number of rows that MySQL needs to examine to return the result.

  • Filtered: The estimated percentage of rows that match the query criteria.

EXPLAIN -> TYPE

Using explain types, you can gain insight into your query performance and identify areas for improvement. If you see const or ref access methods, you likely have a well-structured database that performs well. If you see index or all access methods, it might be time to investigate ways to optimize your database structure or indexing strategy.

Const

The const access method is one of the most efficient. Const access is only used when a primary key or unique index is in place, allowing MySQL to locate the necessary row with a single operation. When you see const in the type column, it's telling you that MySQL knows there is only one match for this query, making the operation as efficient as possible.

Ref

The ref access method is slightly less efficient than const, but still an excellent choice if the right index is in place. Ref access is used when the query includes an indexed column that is being matched by an equality operator. If MySQL can locate the necessary rows based on the index, it can avoid scanning the entire table, speeding up the query considerably.

Fulltext

MySQL provides an option to create full-text indexes on columns intended for text-based search queries. The fulltext access method is used when a full-text index is in place and the query includes a full-text search. Fulltext access allows MySQL to search the index and return the results quickly.

Range

When you use range in the where clause, MySQL knows that it will need to look through a range of values to find the right data. MySQL will use the B-Tree index to traverse from the top of the tree down to the first value of the range. From there, MySQL consults the linked list at the bottom of the tree to find the rows with values in the desired range. It's essential to note that MySQL will examine every element in the range until a mismatch is found, so this can be slower than some of the other methods mentioned so far.

Index

The index access method indicates that MySQL is scanning the entire index to locate the necessary data. Index access is the slowest access method listed so far, but it is still faster than scanning the entire table. When MySQL cannot use a primary or unique index, it will use index access if an index is available.

All

The all access method means that MySQL is scanning the entire table to locate the necessary data. All is the slowest and least efficient access method, so it's one that you want to avoid as much as possible. MySQL may choose to scan the entire table when there is no suitable index, so this is an excellent opportunity to audit your indexing strategy.

Different formats of explain

The explain statement provides several formats that you can use to analyze your queries in more detail. Some of the commonly used explain formats are tree, JSON, and EXPLAIN ANALYZE.

Explain tree format

The tree format is useful for providing more detail into the execution plan in a nested tree structure. You can specify this format by adding format=tree at the beginning of the explain statement.

explain format=tree select * from people where first_name = "Aaron"

We're providing with the following output, which estimates the cost of each part of the query. (This query only has one part, so it doesn't look too much like a tree!)

-> Index lookup on people using multi (first_name='Aaron')  (cost=198.00 rows=180)

Explain JSON format

The JSON format provides a more detailed view of the same information as provided in the tree format. You can specify this format by adding format=json at the beginning of the explain statement.

In the JSON format, MySQL provides information about the query's index usage, key parts, and query cost in a machine-readable, JSON format.

Explain analyze format

EXPLAIN ANALYZE actually runs the query and provides detailed statistics on the query's execution plan. It's important to note that this format actually runs the query, so it should be used with caution.

If you use the EXPLAIN ANALYZE format, MySQL will provide the detailed statistics that include the actual execution time and number of records read. If you're working with queries involving joins or subqueries, these statistics can be invaluable for understanding where the query is spending its time.

Index obfuscation

Efficiently querying your data: best practices

As data sets grow in size and complexity, it becomes essential to query that data efficiently. One of the most crucial ways to achieve this is by using indexes. Indexes help to optimize data retrieval, and while they are derived from your access patterns, it is possible to obfuscate them.

Index obfuscation

Imagine you have a table full of movies, and you want to filter out movies that are under two hours long. The length of each movie is in minutes, so you need to divide the length column by 60 to convert it to hours before running your query. However, by doing this, you've obfuscated your column and made it difficult for MySQL to use your index.

To understand this better, let's look at how to avoid index obfuscation. First, always leave your column alone as much as possible! Any changes you make to it, such as dividing or combining it with other columns, make it more difficult to use an index effectively. Instead, move everything to the other side of the operator when possible.

In our case, instead of dividing the length column by 60, we should multiply it by 60 on the other side of the operator. By doing this, we allow the MySQL engine to calculate the result once and compare it against the index more efficiently.

For example, instead of this:

    SELECT * FROM film WHERE length / 60 < 2;

We should write it like this:

    SELECT * FROM film WHERE length < 2 * 60;

This simple change can lead to a significant improvement in query performance because it allows MySQL to potentially utilize an index on the length column.

Redundant and approximate conditions

Unlocking indexes with redundant and approximate conditions in MySQL

When it comes to database optimization, finding ways to improve query performance is crucial. One approach to unlocking indexes without making changes to the table structure is to use redundant and approximate conditions. Although not a common occurrence, these types of conditions have the potential to make a significant impact on performance once identified. In this video, we'll explore what redundant and approximate conditions are, why they're valuable, and how to use them to improve query performance.

Understanding redundant conditions

Redundant conditions refer to query conditions that logically cannot change the result set. Even though they don't produce any changes in the result, they're still valuable because they can help optimize the query. For instance, consider the following example of querying a people table:

    SELECT * FROM people WHERE id <= 5

This query returns the first five rows from the people table based on the given condition. However, if we modify the query as shown below, we still get the same result.

SELECT * FROM people
  WHERE
    id <= 5
    AND
    id <= 10

The second condition is redundant because it doesn't change the result of the query at all. The key benefit of redundant conditions lies in the fact that they help unlock indexes without any changes to the table. So, it's worth looking out for these types of conditions when optimizing your database. In this example we're not getting any benefit from the redundant condition, but there are times when we can get huge benefits.

Redundant condition example

Consider a scenario where you have a to-do list and want to find all the to-do's that are due in the next 24 hours. You may have an index on the due date, but not the due date combined with due time. In this case, you can make use of the redundant condition, which logically cannot change the results and can help you use the available index more efficiently.

Query without the redundant condition:

SELECT * FROM todos
  WHERE
  ADDTIME(due_date, due_time) BETWEEN NOW() AND NOW() + INTERVAL 1 DAY

The above query retrieves all the to-do's that are due within the next 24 hours based on the due_date and due_time columns. Unfortunately we've obfuscated the index on the due_date column.

In this case we can add a redundant condition that is broader than our original condition, but is not index-obfuscated.

SELECT * FROM todos
  WHERE
  ADDTIME(due_date, due_time) BETWEEN NOW() AND NOW() + INTERVAL 1 DAY
  AND
  due_date BETWEEN CURRENT_DATE AND CURRENT_DATE + INTERVAL 1 DAY

Using the approximate condition helps to narrow down the result set even though it may include a few false positives. After narrowing down the result set, you can then use the expensive filtering, which is not indexable, to remove the false positives and get the correct result set.

Select only what you need

The "select only what you need" principle is a best practice for a reason! It allows you to retrieve only the data that you require, which in turn leads to better performance and memory usage. However, at the application layer, you must be careful not to access columns that you didn't retrieve. You can use "safety mechanisms" in most frameworks to overcome this problem.

By being mindful of these principles and leveraging the solutions provided, you can use "select only what you need" efficiently and safely in your database queries in the future.

Best practices for querying: select only what you need

The importance of only selecting what you need

Let's start with why you should only select what you need. Suppose you run a SELECT * statement on a database table. In that case, you’ll retrieve all the data in that table, including every row and column. Retrieving all this data can lead to significant performance issues, especially if you're dealing with large columns like JSON or TEXT.

There's no need to retrieve all the data if you're not going to use it! It's essential only to pull back the columns you plan on actually using. Doing this will reduce network utilization, disk access, and memory usage.

The Active Record exception

One caveat to the "select only what you need" principle is when you are partially populating an Active Record style model.

Suppose your framework issues a SELECT * statement that then populates an Active Record type model. It will assign every column to a property in the model. You might want to select only a particular set of columns, but in doing so, you might miss essential columns unintentionally. If your framework doesn't protect you from accessing the properties that weren't selected, you might be dealing with a partially populated model and not be aware of it!

The good news is that most frameworks have protections against this problem. You can turn on "safety mechanisms" that prevent you from accessing attributes that haven't been pulled from the database. If your framework has this protection, you can limit your select statements and feel safe that you won't accidentally access any column that you didn't retrieve.

Limiting rows

By using indexes and limiting the number of rows returned, we can greatly optimize our queries and improve the performance of our relational databases. We should always strive to only request the data that we need and minimize the amount of unnecessary data that we send back and forth between the database and our application.

Limiting the number of rows returned in MySQL

As we continue our discussion of efficient querying, we will now focus on the importance of limiting the number of rows returned in a query. The principle remains the same: do not bring back data that you do not need and will simply throw away.

Using indexes to eliminate data

Indexes play a crucial role in optimizing queries in a relational database. When querying a table, we want our indexes to narrow down to just the data that we need so that we can minimize the amount of data that we request from the actual table. Sometimes, even when using indexes, we may still need to do some additional elimination on a column that was not indexed!

Only return necessary rows

When querying a table, we should only request the rows that we need instead of requesting all of the rows and then discarding most of them. Here are a few scenarios where we should only return the necessary rows:

Counting rows in a table

If we want to count the number of rows in a table, we should not select all of the data and send it back to our application. Instead, we should push this down to the database and have the database do the counting. This way, the database does not need to access every row in the table and send it back over the wire. To count rows, we can use the following

SELECT COUNT(*) FROM table_name;

Minimums, maximums, averages

Calculations such as minimums, maximums, and averages should be done in the database instead of in our application. By doing these calculations in the database, we eliminate the need to look at all of the rows and send them back. To find the minimum in a column, we can use the following

SELECT MIN(column_name) FROM table_name

Likewise, to find the maximum or average in a column, we can use SELECT MAX(column_name) FROM table_name. To find the average in a column, we can use SELECT AVG(column_name) FROM table_name.

Distinct lists

If we want to get a distinct list of values for a particular column, we should use SELECT DISTINCT column_name FROM table_name. By doing this in the database, we can avoid sending back unnecessary rows to our application. However, if we want to get a distinct list of values for multiple columns, we will need to issue multiple queries.

Pagination

Pagination refers to the technique of only returning a specific page of results that our users might be interested in. For example, if we return 500,000 results but only show 10 to our users, we have wasted a lot of processing. Thus, it is important to limit the number of rows returned and only return as many rows as will be useful.

When using pagination queries such as LIMIT 10 OFFSET 20, it is important to note that there are specific rules and best practices to follow. We should always put an ORDER BY clause in our limits, even if it is simply ORDER BY id. Otherwise, MySQL gets to decide how to order the rows, which can cause inconsistencies. We will explore how to perform pagination in an index-assisted way to minimize the amount of data that we request from the actual table.

An overview of joins

Inner joins return only rows that have matches in both tables, left joins return all rows from the left table and any matching rows from the right table, and right joins return all rows from the right table and any matching rows from the left table. Full outer joins are not available in MySQL, but you can approximate them if necessary.

If you're working with large tables, it's important to index your data properly to avoid long processing times when joining tables together.

Setting up the basic tables

We'll use the store and staff tables as examples. Here's how we'll query them:

SELECT * FROM store;

| id | manager_staff_id | name     | address_id | last_update         |
|----|------------------|----------|------------|---------------------|
|  1 |                1 | Downtown |          1 | 2023-02-10 17:39:41 |
|  2 |                2 | Uptown   |          2 | 2023-02-10 17:39:41 |
|  3 |                  | Kiosk    |          3 | 2023-02-10 19:17:42 |

This returns three stores: a downtown store, an uptown store, and a kiosk.

SELECT * FROM staff;

| id | first_name | last_name | store_id | last_update         |
|----|------------|-----------|----------|---------------------|
|  1 | Mike       | Hillyer   |        1 | 2023-02-15 03:57:16 |
|  2 | Jon        | Stephens  |        2 | 2023-02-15 03:57:16 |
|  3 | Aaron      | Francis   |        1 | 2023-02-10 19:17:50 |

This returns three staff people: Mike, Jon, and Aaron.

Looking at the store table, we see that there is a manager_staff_id column. This column tells us which staff person is the manager of the store. The downtown store is managed by employee number one, the uptown store by employee number two, and the kiosk has no manager.

Inner joins

Let's start with an inner join. An inner join takes the left table and the right table and matches them up together based on the criteria you specify. It only returns results that have a link in both tables.

SELECT * FROM store -- left table
  INNER JOIN staff -- right table
  ON store.manager_staff_id = staff.id;

| id | manager_staff_id | name     | last_update         | id | first_name | last_name | store_id | active | username | last_update         |
|----|------------------|----------|---------------------|----|------------|-----------|----------|--------|----------|---------------------|
|  1 |                1 | Downtown | 2023-02-10 17:39:41 |  1 | Mike       | Hillyer   |        1 |      1 | Mike     | 2023-02-15 03:57:16 |
|  2 |                2 | Uptown   | 2023-02-10 17:39:41 |  2 | Jon        | Stephens  |        2 |      1 | Jon      | 2023-02-15 03:57:16 |

This query does not return the kiosk because it has no manager, and it doesn't return Aaron because they're not a manager. It only returns the downtown and uptown stores, and their respective managers.

Left joins

Now let's try a left join. A left join returns all the records from the left table, and any matching records from the right table.

SELECT * FROM store -- left table
  LEFT JOIN staff -- right table
  ON store.manager_staff_id = staff.id;

| id | manager_staff_id | name     | last_update         | id | first_name | last_name | store_id | active | username | last_update         |
|----|------------------|----------|---------------------|----|------------|-----------|----------|--------|----------|---------------------|
|  1 |                1 | Downtown | 2023-02-10 17:39:41 |  1 | Mike       | Hillyer   |        1 |      1 | Mike     | 2023-02-15 03:57:16 |
|  2 |                2 | Uptown   | 2023-02-10 17:39:41 |  2 | Jon        | Stephens  |        2 |      1 | Jon      | 2023-02-15 03:57:16 |
|  3 |                  | Kiosk    | 2023-02-10 19:17:42 |    |            |           |          |        |          |                     |

This query returns all the stores, including the kiosk (which has no manager), and the downtown and uptown stores with their respective managers.

Right joins

What if we want to see all the employees, whether or not they are managers? That's where a right join comes in.

SELECT * FROM store -- left table
  RIGHT JOIN staff -- right table
  ON store.manager_staff_id = staff.id;

| id | manager_staff_id | name     | last_update         | id | first_name | last_name | store_id | active | username | last_update         |
|----|------------------|----------|---------------------|----|------------|-----------|----------|--------|----------|---------------------|
|  1 |                1 | Downtown | 2023-02-10 17:39:41 |  1 | Mike       | Hillyer   |        1 |      1 | Mike     | 2023-02-15 03:57:16 |
|  2 |                2 | Uptown   | 2023-02-10 17:39:41 |  2 | Jon        | Stephens  |        2 |      1 | Jon      | 2023-02-15 03:57:16 |
|    |                  |          |                     |  3 | Aaron      | Francis   |        1 |      1 | Aaron    | 2023-02-10 19:17:50 |

This query returns all the employees, including the author of the blog post, and the downtown and uptown store managers. However, it only returns two store rows, and the kiosk row is missing because it has no manager.

Full outer joins

Some databases have a full outer join, which returns all rows from both tables, whether or not there's a match. MySQL doesn't have this feature. You can approximate it by using a left join and a union all, and then adding a right join, but this can lead to duplicates and may not be efficient for large tables.

Indexing joins

Indexing is critical when it comes to joins in MySQL. By properly indexing the related columns between tables, you can significantly improve the performance of your queries. Always use explain to analyze query execution plans and optimize your queries for performance.

How does MySQL join tables?

When MySQL joins tables together, it needs to figure out which rows from one table match which rows from the other table. One way to do this is by doing a full table scan, which is slow and inefficient. The better way is to use an index on the related columns, which allows MySQL to quickly retrieve the matching rows and combine them.

Many-to-many relationships

Sometimes you'll come across a many-to-many relationship between tables. For example, in a film database, an actor can be in many movies and a movie can have many actors. To link these two entities, you need a joining table with composite primary keys.

Let's take a look at an example. We have a film table full of movies and an actor table full of actors. To get a list of all the movies and the actors that were in those movies, we need to use the film_actor table as the joining table. This table has a composite primary key made up of actor_id and film_id.

Example query

To illustrate the impact of indexing, let's run some queries with and without indexes. We'll use the explain statement to see the query execution plan and the cost of each query. Let's say we want to get a list of the first 10 films and all the actors that were in those movies. Here's the query we'll use:

SELECT
  film.title, actor.first_name, actor.last_name
FROM film
  LEFT JOIN film_actor ON film_actor.film_id = film.id
  LEFT JOIN actor ON actor.id = film_actor.actor_id
WHERE
  film.id <= 10
LIMIT 10;

We'll start by adding indexes to the joining table (film_actor) on the film_id columns:

ALTER TABLE film_actor ADD INDEX idx_fk_film_id (film_id);

Query with indexes

SELECT
  film.title, actor.first_name, actor.last_name
FROM film
  LEFT JOIN film_actor ON film_actor.film_id = film.id
  LEFT JOIN actor ON actor.id = film_actor.actor_id
WHERE
  film.id <= 10

Using the explain format=tree statement, we can see that MySQL is using the added indexes to join the tables and the cost of the query is only 29 cost units.

-> Nested loop left join  (cost=29.42 rows=55)
    -> Nested loop left join  (cost=10.25 rows=55)
        -> Filter: (film.id <= 10)  (cost=2.82 rows=10)
            -> Index range scan on film using PRIMARY  (cost=2.82 rows=10)
        -> Covering index lookup on film_actor using idx_fk_film_id (film_id=film.id)  (cost=1.06 rows=5)
    -> Single-row index lookup on actor using PRIMARY (id=film_actor.actor_id)  (cost=0.84 rows=1)

Query without indexes

Turn the index off (by making it invisible) and run the same query again:

ALTER TABLE film_actor ALTER INDEX idx_film_id INVISIBLE;
SELECT
  film.title, actor.first_name, actor.last_name
FROM film
  LEFT JOIN film_actor ON film_actor.film_id = film.id
  LEFT JOIN actor ON actor.id = film_actor.actor_id
WHERE
  film.id <= 10

This time, using the explain statement, we can see that MySQL is doing a full table scan and the cost of the query is now 12,000 units.

-> Nested loop left join  (cost=12347.65 rows=54620)
    -> Left hash join (film_actor.film_id = film.id)  (cost=5519.98 rows=54620)
        -> Filter: (film.id <= 10)  (cost=3.02 rows=10)
            -> Index range scan on film using PRIMARY  (cost=3.02 rows=10)
        -> Hash
            -> Index scan on film_actor using PRIMARY  (cost=54.93 rows=5462)
    -> Single-row index lookup on actor using PRIMARY (id=film_actor.actor_id)  (cost=0.08 rows=1)

Subqueries

Subqueries in MySQL are a powerful tool for filtering data efficiently. MySQL can optimize your subqueries using the semi-join and anti-join optimization techniques. It's important to optimize your queries and test them with your dataset to ensure they perform well under various conditions!

If you want to learn more about subqueries in MySQL, check out Chapter 8 on optimizations in the MySQL documentation. This resource is a treasure trove of information that can help you become an expert in MySQL.

Exploring subqueries in MySQL

Let's take a look at an example to better understand subqueries. Suppose you are working with customer data and payments data, and you need to find a list of customers who have made payments greater than $5.99. To achieve this, you can join the two tables and filter down to customers who have spent more than $5.99. However, after trimming down the data, you might find duplicate customer records in the results. Here is the code to join customer and payment tables:

SELECT * FROM
  customer
  INNER JOIN payment ON customer.id = payment.customer_id
WHERE
  payment.amount > 5.99;

To remove the duplicate records, you can use a DISTINCT statement as follows:

SELECT
  DISTINCT customer.first_name, customer.last_name
FROM
  customer
  INNER JOIN payment ON customer.id = payment.customer_id
WHERE
  payment.amount > 5.99;

This method works, but it can be a pain to use DISTINCT, and you might end up creating a lot of data just to remove it.

Using subqueries in MySQL

Instead of a join, we will use a subquery in this case. Here is the code:

SELECT * FROM customer
WHERE
  id IN (
    SELECT customer_id FROM payment WHERE amount > 5.99
  );

The subquery returns only the customer IDs who match this criterion of having spent more than 5.99, and the outer query returns all of the corresponding customer data for the IDs that match the inner query.

One advantage of using subqueries is that you don't have to join all the data together and perform a DISTINCT operation after trimming it down. This approach can also improve the overall performance of your queries, as MySQL can use a semi-join optimization technique to process the data more efficiently.

Common table expressions (CTEs)

Common Table Expressions offer a powerful tool for composing complex queries in MySQL. By breaking down queries into smaller, more reusable parts, you can make your code more readable and maintainable. Furthermore, by using CTEs, you can reduce the number of times that MySQL needs to execute a query, resulting in faster and more efficient queries. If you haven't already, give CTEs a try in your next project, and see how much simpler your queries can become!

What are common table expressions (CTEs)?

At their core, a Common Table Expression is a SQL statement that can be referenced within the context of a larger query. CTEs are supported in MySQL 8. CTEs allow queries to be broken down into smaller parts that can be more easily comprehended by us mere humans. By doing so, it becomes simpler to reason about and compose complex queries.

How to use CTEs in MySQL

In MySQL, CTEs can be created using the WITH keyword. For example:

with cte_name as (
    select ...
)
select ...
from cte_name

The WITH keyword is followed by the name of the CTE and the query that generates it in parentheses. After defining the CTE, we can reference it in another query. Here's an example that uses CTEs to find customers who have spent more than the average on purchases at a particular store:

with spend_last_6 as (
  select
    customer_id,
    sum(amount) as total_spend
  from
    payment
    inner join customer on customer.id = payment.customer_id
  where
    store_id = 1
    and payment_date > CURRENT_DATE - INTERVAL 6 MONTH
  group by
    cusomter_id
)

select * from spend_last_6

--

select * from spend_last_6
where total_spend > (select avg(total_spend) from spend_last_6)

This query uses CTEs to define a temporary table called spend_last_6 which encapsulates all the logic required to calculate the amount a customer has spent in the last six months at store number one. Then we can select from it as if it were a normal table!

Recursive CTEs

Recursive CTEs can be used to generate data or work with data that you already have. They are a powerful tool to fill gaps in missing data or generate hierarchical data structures.

Recursive CTEs in MySQL

In SQL, Common Table Expressions (CTEs) are often used to combine multiple SQL queries into a single result set. However, there's a lesser-known type of CTE called a Recursive CTE, which can be used for generating data and manipulating hierarchical data structures.

To get started with Recursive CTEs, it's important first to understand the basics of CTEs themselves. A CTE is essentially a temporary result set that can be referred to within a SQL statement. CTEs allow you to break complex queries into more manageable, modular components.

Recursive CTEs, on the other hand, refer to themselves repeatedly to build up data. This simple fact enables Recursive CTEs to perform tasks that would be impossible with regular CTEs.

Building a sample recursive CTE

Example of how to use a Recursive CTE to build a simple list of numbers in SQL. We'll create a table called "numbers" and define our Recursive CTE to generate a sequence of numbers from 1 to 10.

WITH RECURSIVE numbers AS (
  SELECT 1 AS n -- Initial Condition
  UNION ALL
  SELECT n + 1 FROM numbers WHERE n < 10 -- Recursive Condition
)

SELECT * FROM numbers;

Here's what's happening in this code:

  • We define a CTE called numbers using the WITH keyword and specify the RECURSIVE modifier.

  • We define our initial condition, which selects the number 1 and assigns it the alias n.

  • We define our recursive condition, which selects the value of n + 1 from the table numbers. This incrementally generates the sequence of numbers from 1 to 10.

If we run this code, we get a list of numbers from 1 to 10:

 n
---
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
(10 rows)

Working with dates in recursive CTEs

Recursive CTEs are not limited to generating lists of numbers. They can also work with dates and timestamps by generating sequences of dates.

Here's an example of how to use a Recursive CTE to generate a sequence of dates throughout the year 2023:

WITH RECURSIVE all_dates AS (
  SELECT '2023-01-01' AS dt -- Initial Condition
  UNION ALL
  SELECT dt + INTERVAL 1 DAY FROM all_dates WHERE dt < '2023-12-31' -- Recursive Condition
)
SELECT * FROM all_dates;

If we run this code, we get a sequence of dates from January 1, 2023, to December 31, 2023:

    dt
------------
 2023-01-01
 2023-01-02
 2023-01-03
 2023-01-04
 2023-01-05
 ...

Using Recursive CTEs, we can generate date sequences and use them to fill in gaps in our data. Let's see how we can use this generated sequence to fill in dates with missing data.

Filling in gaps in data

One of the benefits of Recursive CTEs is that they can be used to fill in gaps in data. Let's say we have a table of payments with missing data on certain dates:

payment_date | amount
-------------+--------
 2023-01-02  |  23.51
 2023-01-03  | 150.23
 2023-01-04  | 120.00
 ...

We can fill in these missing dates using a Recursive CTE. Here's an example of how it might look:

WITH RECURSIVE all_dates AS (
  SELECT '2023-01-01' AS dt -- Initial Condition
  UNION ALL
  SELECT dt + INTERVAL 1 DAY FROM all_dates WHERE dt < '2023-12-31' -- Recursive Condition
)

SELECT
  dt,
  sum(ammount)
FROM
  all_dates
  LEFT JOIN payments ON all_dates.dt = payments.date;

Here's what's happening in this code:

  • We define a CTE called "all_dates" using the WITH keyword and specify the RECURSIVE modifier. This generates a sequence of dates throughout the year 2023.

  • We join the "all_dates" CTE to the "payments" table on their date columns.

  • We select all dates from "all_dates" and the amount from "payments."

Now, we have a complete list of dates with the total amount for each date, even those without payment records:

    date     | amount
-------------+--------
 2023-01-01  |   NULL
 2023-01-02  |  23.51
 2023-01-03  | 150.23
 2023-01-04  | 120.00

Unions

A UNION query is used to combine the results of two or more SELECT statements into a single result set. Specifically, it takes the values from one table or query, and instead of putting them side-by-side with another table or another query, it puts them together over-under.

SELECT 1
UNION
SELECT 2;

This will output the result set of 1 and 2, as if they were in a single column.

| 1   |
| --- |
| 1   |
| 2   |

Notice how I used the UNION keyword to combine the two results. Also, bear in mind that the number of columns in all SELECT statements must be the same, with the same data types, or else you'll get an error.

Looking at the above query, you'll see that this could be achieved with a simple SELECT statement. UNIONs are best used when we need to combine two queries that would otherwise be difficult to create in one query.

For example, let's say you have two tables, customers and staff, both with similar column names. You want to combine the results of both tables into a single result set, to produce a mailing list.

  SELECT
    first_name,
    last_name,
    email_address
  FROM customers

  UNION ALL

  SELECT
    first_name,
    last_name,
    email_address
  FROM staff;

Notice that we use UNION ALL instead of UNION to prevent MySQL from eliminating duplicate rows, which can be computationally expensive when we have a large result set. Instead, we simply combine the result sets, knowing that duplicates may exist.

Where to use?

This results in a single list of all staff and customers, with a shared data structure. Besides mailing lists, other applications for unions include when combining archived data with active data or comparing data in different tables that share a similar structure.

  select * from customer
  union all
  select * from customer_archived

Window functions

Window functions help you do aggregations, rankings, distributions, and more without collapsing the entire result set down to a single row. Instead, you can put all of that information on every row. In this guide, we'll explore what window functions are, how to use them, and some real-world scenarios in which they're useful.

The primary advantage of using window functions is that they enable you to compute values on a per-row basis while summarizing data across groups. You can partition your result set by one or more columns, and then perform calculations within each partition. With window functions, you can calculate running totals, cumulative sums, ranks, percentiles, and more.

How to use window functions in MySQL

SELECT
  customer_id, rental_id, amount, payment_date
FROM
  payment
ORDER BY
  customer_id, payment_date ASC;

This query returns all rentals sorted by customer ID and payment date. But what if we want to add a row number to each rental? We can use the ROW_NUMBER() window function to add a sequential number to each rental within the same partition:

SELECT
  customer_id, rental_id, amount, payment_date
  ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY payment_date ASC) AS num,
FROM
  payment
ORDER BY customer_id, payment_date ASC;
SELECT
  customer_id, rental_id, amount, payment_date
  ROW_NUMBER() OVER(w_asc) AS num,
  FIRST_VALUE(rental_id) OVER(w_asc) as first_rental
  FIRST_VALUE(rental_id) OVER(w_desc) as last_rental
  sum(amount) OVER(w_asc) as running_total
  LAG(rental_id) OVER(w_asc) as prev_rental
  LEAD(rental_id) OVER(w_asc) as next_rental
FROM
  payment

window
  w_asc as (PARTITION BY customer_id ORDER BY payment_date ASC),
  w_desc as (PARTITION BY customer_id ORDER BY payment_date DESC)

ORDER BY customer_id, payment_date ASC;

By including the OVER() clause and specifying the partition and sort order, we can add a sequence number number to each rental.

Consider performance when working with large result sets

While window functions can be powerful, they can also be resource-intensive, especially when working with large result sets. Consider the performance implications of using window functions before including them in your queries.

Sorting and limiting

Sorting your rows is not free. It can take a significant amount of resources and time to sort large data sets, and thus, it's important to do it with caution. If you don't need your rows in a certain order, don't order them.

Limiting results

To limit your results to 10 rows, for example, you can use the following query:

SELECT * FROM people LIMIT 10;

Sorting results

To order your results in MySQL, you can use the ORDER BY clause. By default, MySQL sorts in ascending order. Here's an example:

SELECT id, birthday FROM people ORDER BY birthday;


SELECT id, birthday FROM people ORDER BY birthday DESC;

It's important to note that if you're sorting by a column that is not specific enough to be deterministic, MySQL gets to decide what order to return the rows in. For example, if you sort by birthday and two people have the same birthday, MySQL gets to decide the order in which they appear.

To make the sorting deterministic, you should add more columns to the ORDER BY clause. For example, you can add the id column, which is guaranteed to be unique:

SELECT id, birthday FROM people ORDER BY birthday, id;

This query will sort the rows by birthday first and then by ID, making the sorting deterministic.

Offset your results

To offset your results in MySQL, you can use the OFFSET clause. Let's say you want to skip the first 20 rows and return the next 100 rows. You can use the following query:

SELECT id, birthday FROM people ORDER BY birthday, id LIMIT 100 OFFSET 20;

It's important to note that when you use the OFFSET clause, the sorted result must be produced first. Whether it's done manually or using an index, the sorted result must be produced before the offset is applied. In the worst case scenario, the entire list must be produced, sorted, and then the offset is applied, which can be inefficient.

Sorting with indexes

By adding indexes to our tables, we can reduce the need for secondary sorting operations and improve query performance.

We will focus on the people table, which contains 500,000 records, and has no indexes except for the primary key. This is sufficient for this demo, but not ideal for real-world situations.

Basic order by operation

We will begin by ordering the results of the people table by birthday and using a limit of 10.

SELECT
  *
FROM
  people
ORDER BY
  birthday ASC
LIMIT
  10;

When we run this query and examine it using the EXPLAIN statement, we see that it employs the "using file sort" method. This means that MySQL produced a sorted result by sorting all the results, not by reading an index in order.

idselect_typetabletypepossible_keyskeykey_lenrefrowsfilteredExtra
1SIMPLEpeopleALL491583100.00Using filesort

Indexes and ordering

To avoid the costly sorting operation, we need to add an index to the birthday column of the people table.

ALTER TABLE people
ADD INDEX birthday (birthday);
idselect_typetabletypepossible_keyskeykey_lenrefrowsfilteredExtra
1SIMPLEpeopleindexbirthday310100.00

This time, it is able to use the birthday index to sort the results, and the "using file sort" method is gone.

Deterministic sorting

In order to ensure that our rows are returned in the same order every time, our sorting must be "deterministic." In many cases, this involves sorting by a unique identifier or primary key.

For example, if we add the ID column to our original query and sort by ID after sorting by birthday, we can ensure deterministic sorting.

SELECT
  *
FROM
  people
ORDER BY
  birthday ASC, ID ASC
LIMIT
  10;

In this case, MySQL can still use the birthday index, even though there is another column involved in the sorting process.

Backward index scans

In MySQL 8.0 or later, we can perform backward index scans when sorting in descending order.

For example, if we set the birthday index to order in descending order and rerun the query, we can see that MySQL performs a backward index scan. This is a new optimization that MySQL uses when it is able to read an index in reverse order.

alter table people add index (birthday desc);

Sorting with composite indexes

Sorting with multi-column indexes can be a powerful tool for improving database performance. It's important to follow the established rules on index creation and access patterns. By unlocking key parts and forming left prefixes, we can avoid file sorting and make use of all key parts in our index. Additionally, we must ensure that the sorting direction in our query matches the ordering of our key parts. With these techniques, we can take full advantage of multi-column indexes in MySQL.

Creating a composite index

Creating a composite index over three columns: first name, last name, and birthday.

ALTER TABLE people ADD INDEX composite_idx(first_name, last_name, birthday);

Note that the order of columns in the index matters. We cannot skip columns or read them out of order when creating or using the index.

Sorting with multi-column indexes

Now that we have our multi-column index set up, let's explore sorting with it. We'll use the SELECT statement and order by first name, then limit the output to the first ten entries:

SELECT
 *
FROM
  people
ORDER BY
  first_name
LIMIT 10;

We should see that the query executes without any file sorting. This is because the multi-column index allows the database to efficiently sort and retrieve the requested data.

Unlocking key parts for sorting

Let's say we want to sort by last_name but skip over first_name. According to our rules, this is not allowed. However, we can unlock the last_name key part for sorting by adding an equality condition on the first_name column. Here's the query:

SELECT
  *
FROM
  people
WHERE
  first_name = 'John'
ORDER BY
  last_name;

The database can now use the unlocked last_name key part to sort the data. However, note that the birthday column is still locked behind the last_name key part and is not available for sorting or filtering.

Using all key parts for sorting

To use all three key parts of our index, we need to form a left prefix using the ordering in our query. We can do this by adding first_name back into our SELECT and WHERE clauses. Here's our new query:

SELECT
  *
FROM
  people
WHERE
  first_name = 'John'
ORDER BY
  last_name, birthday;

Now we can sort by all three columns without file sorting. Once again, note that the order of columns in the index and query must match. Also, the id column is always appended to the end of the index but is subject to the same access rules.

Sorting direction and key parts

Finally, let's discuss sorting direction and how it relates to our multi-column indexes. We can sort each key part in ascending or descending order, but we must make sure that the ordering matches our query. If we sort one key part differently from our query, we may end up with file sorting.

Here are some examples:

  • If both key parts are sorted as ascending or descending, we get no file sort.
  • If one key part is sorted as ascending and the other as descending, we get a file sort.
  • If we swap the order of key parts in our index, we may get a backward index scan.

Remember that it's not the order of columns in the index that matters, but the relationship between them and the sorting direction in the query.

alter table people add index (first_name asc, last_name desc, birthday );

Counting results

MySQL allows you to count multiple things using expressions by taking advantage of the way the COUNT() function works. You can use the IF() or SUM() function to create conditions based on your requirements and calculate the sum of values that meet those conditions. This feature helps you quickly and efficiently count various values from a database all at once.

Before diving into counting multiple things using expressions, let's have a quick refresher on two basic counting techniques.

  • Counting Rows: To count the number of rows in a table, use the COUNT(*) function. This function counts all rows, regardless of whether they contain null values.
  • Counting Non-Null Values in Columns: To count the number of non-null values in a column, use the COUNT(column_name) function. This function counts only non-null values in the specified column.

Using expressions to count multiple things

Let's say we have a table called rental that contains rental information such as rental date, customer ID, and return date. We are interested in knowing how many rentals were done on weekends and how many were done on weekdays.

The DAYOFWEEK function

Next, we can use the DAYOFWEEK(date) function to get the numeric value of the day of the week. This function returns a value between 1 (Sunday) and 7 (Saturday).

SELECT
  rental_date, DAYOFWEEK(rental_date)
FROM
  rental;

Running this query shows us the rental date and the corresponding day of the week number. This helps us identify the rental dates that fall on weekends.

Using the count and if functions

Now that we have identified the rental dates that fall on weekends, we can use the COUNT() function to count them. However, instead of using the basic COUNT(*) or COUNT(column_name) functions, we will use the COUNT(if_statement) function. This function allows us to count values based on a specified condition.

In our case, we want to count the number of rentals that occurred on weekends. We can use the IF(condition, value_if_true, value_if_false) function to create a column that assigns a value of 1 for rental dates that fall on weekends and a value of null for those that do not.

SELECT
  COUNT(IF(DAYOFWEEK(rental_date) IN (1, 7), 1, NULL)) AS weekend_rentals,
  COUNT(IF(DAYOFWEEK(rental_date) NOT IN (1, 7), 1, NULL)) AS weekday_rentals,
  COUNT(return_date) AS completed_rentals,
  COUNT(*) AS total_rentals
FROM
  rental;

Running this query shows us the number of rentals that occurred on weekends and weekdays, as well as the total number of completed rentals and the total number of rentals.

Another approach — using the sum and if functions

While the previous approach is effective, it requires the use of the COUNT() and IF() functions inside each other. An alternative approach is to use the SUM() function instead. This function calculates the sum of a column based on a specified condition.

In our case, we can create a new column for rentals that occurred on weekends and calculate the sum of this column using the SUM() function.

SELECT
  SUM(DAYOFWEEK(rental_date) IN (1, 7)) AS weekend_rentals,
  SUM(DAYOFWEEK(rental_date) NOT IN (1, 7)) AS weekday_rentals,
  COUNT(return_date) AS completed_rentals,
  COUNT(*) AS total_rentals
FROM
  rental;

This query produces the same results as the previous one but uses the SUM() function instead of COUNT() and calculates the value of either 1 or 0 instead of null.

Dealing with NULLs

Null values are a reality of working with databases, but they can often cause headaches if not handled properly. By using the null-safe equal operator, the is null and is not null operators, and the ifnull statement or coalesce function, we can effectively deal with null values in our SQL queries. With these tools in our toolkit, we can ensure that our queries are working as expected and that we’re getting accurate and complete results.

Comparing null Values

When dealing with null values in SQL, one of the first things to consider is how to compare null values. In MySQL, null is not equal to anything, including itself. For example, if we run the query select null = null we’ll get null back as the result.

Similarly, if we use the equals operator to compare a column with a null value, we’ll also get null back as the result. For example, if we run the query select null = one we’ll get null back as the result.

This behavior can be frustrating, especially when trying to filter or compare data. Let’s take a look at an example. Consider the following table, which contains information about films:

To account for null values, we can use the null-safe equal operator <=>, also known as the “spaceship operator.” This operator considers null and another null value as equal.

To compare a column with a null value, we can use the is null or is not null operators. These operators will return true or false, depending on whether the value is null or not null.