This is a series of blogs for advanced topics in SQL. You will need some basic understanding of Database and Management Systems and SQL. I hope you are ready for some in-depth “Indexing”.
What is Indexing?
Whenever we run a query on our database, we scan through the entire database, given a few conditions. Indexing is a method of optimisation that reduces disk access for particular types of query.
Note: Primary key is not the index. You can read more about the Primary key here.
An index is a type of data structure that is used to locate and access a particular data point.
To understand better, let’s assume a Table T as shown below
We need to define that, on which column or set of column do we want to perform the indexing. Let’s assume the following scenarios
Index on T.A
In that case, if a query looks for a ‘cow’, i.e., T.A = ‘cow’, this table won’t be completely scanned — rather an output can be immediately given, i.e., row number 3 and 5.
Now, just to understand better — if T.A = ‘cat’, the output will be the rows 1, 4 and 6.
Index on T.B
In that case, if a query looks for a ‘cow’, i.e., T.B = ‘2’, this table won’t be completely scanned — rather an output can be immediately given, i.e., row 0.
What if we want some more generalized query — say T.B > 6. In this case, the output will naturally be — rows 3, 4 and 6.
Index on T.(A, B)
Yes, it is possible to form an index on tuples as well. But what does it really mean? Consider, there is a scenario where you want a cat with a value greater than 5, i.e., T.A = ‘cat’ and T.B > ‘5’. So, the general way(without indexing) is to execute the 1st condition and then perform the 2nd operation post the execution of the 1st on the results of the 1st condition.
Hence the output would be rows — 4 and 6.
Why do we need Indexing?
There are always scenarios that we need very frequently. So let’s assume a very relatable scenario — We all are trying for Covid vaccines. So, what exactly happens, when you go to the vaccine booth(Given you get a slot :P). The person asks you for your mobile number/Unique Identification number to fetch your personal details (if you are going for the 2nd dose and enters the data if 1st).
Now imagine the number of entries in their database. Imagine the number of total slots existing in your country, all making request and DB serving all those requests.
When entries are in millions — just like the one scenario mentioned, imagine the database scanning the DB again and again, for all the searching requests. How many disk access will be used. What if we didn’t need to scan the complete database?
Indexing helps in avoiding the full database scan for such scenarios.
Now, what if we build indexing on Name? Would that help? No. We build indexes on the frequently used columns/parameters.
Always build indexes on frequently used columns.
How does Indexing happen?
An index is nothing but a data structure. So what exactly is it? Well, we have two options:
- Balanced Trees
- Hash Tables
Both of them come with their own benefits and drawbacks. Understanding the implementation difference is extremely important. We all know, that time complexity for Balanced trees is logarithmic and for Hash tables is a constant.
So that means, Hash tables will optimise your queries more than balanced trees. So why don’t we always use it?
Well, we saw a query — T.B > 6. Queries like this, cannot make use of hash trees. Hence,
For equality queries, we prefer Hash Tables whereas for inequality scenarios we prefer Balanced Trees.
So, in the previous example — when we index on T.A, Index table can look like
Here the column value holds the row number of the database entry of table T defined above.
So, when we ask for cat, cow or dog — it just has to return the value. This is how it works. Now if cat, dog and cow had say 1000000 entries each, we won’t have to worry about the other two.
Enough theory, let’s code. So, that’s pretty easy in SQL. I guess the smallest section. Let’s assume we have a usual Student Database as shown below.
The most common query would be getting the details of a student from their roll number, hence making an Index on the sID would make sense.
The above query would be optimised, however large the data is. Now given the below SQL query, assuming that it might be a frequently used query —
What should be the index based on?
Well obviously, it would make sense to index on sName or GPA. But, to optimise further, indexing on the tuple (sName, GPA) would be the best option.
How do you create indexes?
It’s as simple as anything —
Note: The syntax is based on MySQL.
If you want to read more about it — you can find it here.
Downside of Indexes
Everything comes at its cost.
Indexes require extra space to maintain.
This is again a high operation cost. Not to forget that deciding the column/parameter to tuples that should be indexed is another issue. Wrong indexing would just take some space and doesn’t give you any optimisation.
Let’s give it a thought, the databases might change very frequently. CRUD operations are performed at a very high rate on high traffic DB’s. So,
- Do you perform indexes on every change?
- Do you batch your changes?
Or go for a 3rd option. So, sometimes, indexing can offset the benefits because of maintenance.
People often use Physical Design Advisor to make decisions on Indexing.
Input: Database(statistics) and Workload
Output: Recommended Index
The way it works is, you feed Database Statistics, Queries or Updates you want to make and the Indexes to a Query Optimiser and it gives out the best execution plan with an estimated cost. It tries the updates with different indexes to see who has the minimum cost.
Thanks for reading till here. Next, I will be discussing Transactions. Stay tuned!