Part 1 — What is Indexing? Why it's important to know?

Rehmat Sayany
Westwing Tech Blog
Published in
6 min readJun 23, 2022

--

Working at Westwing, I once received a ticket about a cron that ran all night and never finished. I started looking into the queries that were running and found a problematic query. Here’s what I discovered: the query couldn’t pick up relevant indexes, even though the table contains indexes on that column. This resulted in data read from disk and query fetching being extremely slow. To share my experience, I decided to write a series of blog posts to discuss the concept behind database indexing and how to find out if your query needs any optimization or not.

So … Let's start with what an index is.

An index is an ordered representation of indexed data.

The index makes query retrieval fast. “So if that’s the case let’s index all the columns in the table” one might think — this would not be the correct approach, because creating an index on all of the columns will create the following bad results:

  1. Using Indexes improves read time but it makes write time slower.
  2. With every insert/delete we have to insert/delete indexes and then re-balance out BTREE which will take some time.
  3. Updates only become slower if an indexed column is updated. It might be a good idea to avoid indexes on columns that are very frequently updated.
  4. If more than one index is created for a table, the indexes might need more space than the table itself.
  5. If there are many indexes that can be used for a given query, query optimization may take longer.

So the rule of thumb should be to have as many indexes as necessary but as few as possible.

So what is a B-TREE?

B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time.

B-Tree Representation

The left sub-tree is always smaller than the right sub-tree. The nodes on the top are called Root Nodes, the bottom one is called Leaf nodes and in the middle are called intermediate nodes. The leaf nodes don't have any further subtree. This way it's ensured that every search will take the same number of steps to find the value. The leaf nodes also have doubly linked lists, which allows moving forwards and backward within leaf nodes.

How is indexing used in a database?

When it comes to database indexing, the B-tree data structure gets a little complicated by not just having a key, but also having a value associated with the key. This value is a reference to the actual data record. The key and value together are called a payload.

Let's take an example of the table called Employee where we have only two columns: Id, Name. We have created the index on the Name column only.

Employee Table
Employee Table
B-Tree Representation of Index column Name

The Employee ID is nowhere to be found on this index, you can only see the Name value. In simpler words, The index only contains the value of the index column. Some people think that the index contains all data of the other columns too which is totally incorrect.

So now the question is how would I get the data which is not in the index. Let’s say, for example, that I need the age, department, and employee Id of the Kamil. This is the reason why the Database saves one more attribute which is ROWID. It's not a primary key of our table. It's a Database internal identifier for every row and that way we can go back to the table ( which will be read from the disk ) and find the corresponding row. One thing is to be remembered here is that the index data is in the memory and the non-index data will be on the disk.

How can we minimize reading from disk?

So let's suppose we have a query as shown below and we have a multi-column index on CUSTOMER(LAST_NAME, CITY)

SELECT DISTINCT CITY FROM CUSTOMERS WHERE LAST_NAME = ’Smit'

So in this case there will be no read from disk as both last_name and city data are already present on the index. Another example is when you have an index on CUSTOMERS(ZIP) then the below query will be very fast because it suffices to count the ROWIDs from the index.

SELECT COUNT(*) FROM CUSTOMERS WHERE ZIP = 15260

What are the benefits of using this technique?

  1. In this way, the search will be very fast. If we want data on employee Kamil then we will go directly from Root node to leaf Node and will narrow down our searches because basically we are either searching on the left or right sub-tree and following binary search.
  2. For each query, it will take the approximately same amount of time to get the result.
  3. Either we will find the result or we will get NULL value.

Now let's move forward and see how to find the execution plan of queries. In this way, we can find why some queries are working slower.

The EXPLAIN keyword is used throughout various SQL databases and provides information about how your SQL database executes a query. In MySQL, EXPLAIN can be used in front of a query beginning with SELECT, INSERT, DELETE, REPLACE, and UPDATE. For a simple query, it would look like the following:

EXPLAIN SELECT * FROM sakila.users where id=1

So when you run the above query, the result will look like in the image below.

Execution plan

Let's understand the type column here. If you will look at MySQL Documentation it is defined as a join type. I think join type is not the best name for it, so in the remainder of the article let’s refer to it as access_type.

Now let's find out what are some different values for this access_type Column.

  1. Const/eq_ref
  2. ref/range
  3. index
  4. ALL

Const/eq_ref

If you can guarantee that your query will return a single value that will be unique. There are two ways you can do this. One is if the index is the primary key and another way is that unique constraint on the column.

SELECT * FROM tbl_name WHERE primary_key=1; 

if you get const/eq_ref in your query stop optimizing. It will not get any faster.

Ref/Range

Is used in cases when you can guarantee only rows that are in a given range are retrieved, using an index to select the row. So let's take an example, let's say you have a query where the id is greater than 5 and less than 10. So it will start B-Tree traversal and come to a leaf node where it gets a leaf node with id 6 and then using the doubly linked list it will traverse the leaf nodes and find a range of values from 6 to 9. So it does not have to go back to the Root node. This will save a lot of time.

SELECT * FROM tbl_name   WHERE key_column = 10; SELECT * FROM tbl_name   WHERE key_column BETWEEN 10 and 20;SELECT * FROM tbl_name   WHERE key_column IN (10,20,30);

SELECT * FROM tbl_name WHERE key_part1 = 10 AND key_part2 IN (10,20,30);

Index

It is also known as a full Index scan because it will scan from first to last leaf node and we are using the index but not using it to limit the result.

ALL

A full table scan is done and we are loading all data from disk to memory and filtering them. If your execution plans say access type equals ALL, this means you are not using the index at all which is very bad. This type of query will take the longest time to execute.

The second important column in the execution plan is the rows column. The rows column indicates the number of rows MySQL believes it must examine to execute the query.

The possible_keys column indicates the indexes from which MySQL can choose to find the rows in this table

Let's stop it here and I will tell you in my next blog post how to make your query faster and change your access type from ALL to const/range using some query optimization techniques.

--

--

Full Stack developer @westwing passionate about NodeJS, TypeScript, React JS and AWS.