Advanced Topics in SQL — I | Indexing

This is a series of blogs for advanced topics in SQL. You will need some basic understanding of Database and Management Systems and SQL. I hope you are ready for some in-depth “Indexing”.

Designed by Freepik

What is Indexing?

Note: Primary key is not the index. You can read more about the Primary key here.


To understand better, let’s assume a Table T as shown below

Table T

We need to define that, on which column or set of column do we want to perform the indexing. Let’s assume the following scenarios

Index on T.A

Now, just to understand better — if T.A = ‘cat’, the output will be the rows 1, 4 and 6.

Index on T.B

What if we want some more generalized query — say T.B > 6. In this case, the output will naturally be — rows 3, 4 and 6.

Index on T.(A, B)

Hence the output would be rows — 4 and 6.

Credits: Giphy

Why do we need Indexing?

Now imagine the number of entries in their database. Imagine the number of total slots existing in your country, all making request and DB serving all those requests.

When entries are in millions — just like the one scenario mentioned, imagine the database scanning the DB again and again, for all the searching requests. How many disk access will be used. What if we didn’t need to scan the complete database?

Indexing helps in avoiding the full database scan for such scenarios.

Now, what if we build indexing on Name? Would that help? No. We build indexes on the frequently used columns/parameters.

Always build indexes on frequently used columns.

How does Indexing happen?

  1. Balanced Trees
  2. Hash Tables

Both of them come with their own benefits and drawbacks. Understanding the implementation difference is extremely important. We all know, that time complexity for Balanced trees is logarithmic and for Hash tables is a constant.

So that means, Hash tables will optimise your queries more than balanced trees. So why don’t we always use it?

Well, we saw a query — T.B > 6. Queries like this, cannot make use of hash trees. Hence,

For equality queries, we prefer Hash Tables whereas for inequality scenarios we prefer Balanced Trees.

If you want to read more about Hash Tables and Balanced trees, you can read them here and here respectively.

So, in the previous example — when we index on T.A, Index table can look like

Here the column value holds the row number of the database entry of table T defined above.
So, when we ask for cat, cow or dog — it just has to return the value. This is how it works. Now if cat, dog and cow had say 1000000 entries each, we won’t have to worry about the other two.

Let’s Code

The most common query would be getting the details of a student from their roll number, hence making an Index on the sID would make sense.

The above query would be optimised, however large the data is. Now given the below SQL query, assuming that it might be a frequently used query —

What should be the index based on?

  1. sID
  2. sNAME
  3. contactNumber
  4. GPA

Well obviously, it would make sense to index on sName or GPA. But, to optimise further, indexing on the tuple (sName, GPA) would be the best option.

How do you create indexes?

Note: The syntax is based on MySQL.

If you want to read more about it — you can find it here.

Downside of Indexes

Everything comes at its cost.

Extra Space

Index Creation

Index Maintenance

  1. Do you perform indexes on every change?
  2. Do you batch your changes?

Or go for a 3rd option. So, sometimes, indexing can offset the benefits because of maintenance.

Bonus Section

Input: Database(statistics) and Workload
Output: Recommended Index

The way it works is, you feed Database Statistics, Queries or Updates you want to make and the Indexes to a Query Optimiser and it gives out the best execution plan with an estimated cost. It tries the updates with different indexes to see who has the minimum cost.

Thanks for reading till here. Next, I will be discussing Transactions. Stay tuned!

Entrepreneur | SE at Microsoft | Ex-DSC Lead | Winner of SIH2019 | Android Developer