SQL 1: Intro to SQL and the SELECT Statement
SQL stands for Structured Query Language – see, you're halfway through the first sentence of this course, and you're already half asleep! Simply put, SQL is a way for us humans to communicate with databases. And "database" is a pretty generic term for "way for a computer to store data". There are lots of ways for a computer to store data, including (but certainly not limited to):
- CSV, JSON, or other structured text file
- Spreadsheet
- Document database
- Relational database management system (RDBMS)
That last one – relational databases or RDBMS – is the one we're going to concentrate on in this course. While there are many software applications that store data in other ways, including those listed, a vast majority of business-class software applications use some flavor of relational database to store data. These "flavors" are vast – some are open-source solutions, and some are proprietary and commercial. Examples of these are:
- MySQL (open-source)
- PostgreSQL (open-source)
- SQLite (open-source)
- IBM DB2 (closed-source, commercial)
- Microsoft SQL Server (closed-source, commercial)
- Oracle Database (closed-source, commercial)
And, although there are a large number of these flavors, developed by many different companies and organizations, the core of the SQL language is a standard that may be used to communicate with any of them.
Let's dive right in.
The Basics
Relational databases store data in tabular format. If you've ever used Microsoft Excel, Google Sheets, or any other spreadsheet application to store a list of anything, you've probably invented a way to organize data that is very similar to how databases do it – perhaps:
- Your data is stored in a grid format
- The columns of the grid represent the attributes of the thing you are storing
- There is a row in the grid for each one of the things, and maybe that's called a record
For example, if I were to build a spreadsheet of my favorite movies, it might look something like this:
# | Title | Year | Rating |
---|---|---|---|
1 | Star Wars | 1977 | PG |
2 | Apollo 13 | 2004 | PG |
3 | The Matrix | 2007 | R |
Pretty straightforward. And, if I want to add more attributes to my data – say, I want to add the studio that produced the movie – I'd simply add to the columns of data.
# | Title | Year | Rating | Studio |
---|---|---|---|---|
1 | Star Wars | 1977 | PG | Lucasfilm |
2 | Apollo 13 | 2004 | PG | Universal |
3 | The Matrix | 2007 | R | Warner Bros. |
And, if I want to add more movies, I'd add to the rows of data.
# | Title | Year | Rating | Studio |
---|---|---|---|---|
1 | Star Wars | 1977 | PG | Lucasfilm |
2 | Apollo 13 | 2004 | PG | Universal |
3 | The Matrix | 2007 | R | Warner Bros. |
4 | Forrest Gump | 1994 | PG-13 | Paramount |
As simple as this example is, this is precisely how data is structured in a relational database system. To offer up some definitions – the columns are known as the fields or attributes, the rows are known as records, and the rows+columns together is a table. When we have a collection of many tables, that's a database.
In practice, the main difference between a spreadsheet (e.g. Excel) and a database system is that Excel already includes a standard front-end by which an end-user (like you) is able to read and write to the data. Since the use-cases for RDBMSes are much more varied and programmable than a spreadsheet, there's no such standard for relational databases. Instead, we use SQL to talk with the database, and to pass the results on to our front-end of choice, whether that's a custom web application, a PDF report, a commercial front-end like Tableau, or whatever else might be needed.
Using SQL to read data
In this lesson, we'll be looking at ways we can use SQL to read data from a relational database. In a later lesson, we will be looking at using SQL to write to and modify a database.
The SELECT statement
We read data from a database by using the SELECT
statement in SQL. Let's look at an example. Suppose we own an e-commerce store – we'd probably have a a database table used to store the products we sell. This table would likely be named Products
and look something like this:
Name | Price | Department |
---|---|---|
Camera | $299 | Electronics |
Sofa | $699 | Furniture |
Dining Table | $1299 | Furniture |
Toaster | $79 | Housewares |
The SELECT
statement allows us to ask the database questions. What products do we sell? How much does the Toaster cost? How many products cost more than $500?
It would be nice if the SQL language knew how to answer your question, in exactly the way you asked it. But, no, it's a computer, so we have to follow a simple formula so that you can ask the question in a way that SQL can understand:
- Identify the table that holds the information you need
- Identify the columns that hold the information you need
- Add any other conditions
- Then, ask the question in SQL
For instance, if my question is: what are the names of the products we sell?
- Identify the table that holds the information you need –
Products
- Identify the columns that hold the information you need –
Name
- Add any other conditions – none
- Then, ask the question:
SELECT Name FROM Products;
There are 3 things about this SELECT
statement that I'll point out:
- It ends with a semicolon
;
– this depends on the flavor of RDBMS and the front-end being used, but many databases will require the semicolon at the end of the SQL statement, so all the examples we'll use will include it - The column and table names are case-sensitive; again, this will depend on the specific database system, but for illustration's sake, all examples will match the table/column name casing in the database
- The
SELECT
andFROM
clauses are in ALL CAPS; this is NOT a requirement in SQL, but clauses in examples will be written this way, as it is often more readable
The result that we'll receive is actually another, temporary table – i.e. one that is not part of the database structure or schema as-designed, but is simply a temporary subset of the larger table we've targeted. I like to think of this as a "👶baby table" – a much smaller table that looks a lot like the larger, original table, but only holds the data we asked for:
Name |
---|
Camera |
Sofa |
Dining Table |
Toaster |
To be clear, "👶 baby table" is not an official SQL term... it's just what I like to call it.
Now, it's up to you to decide if the results provided in this "👶 baby table" are everything you need, or if you need to modify your question in order to get more information.
Hmmm, it would nice if I could see the price too. Can I get a list of all the products we sell, along with the price?
Back to the formula:
- Identify the table that holds the information you need –
Products
- Identify the columns that hold the information you need –
Name
andPrice
- Add any other conditions – none
- Then, ask the question:
SELECT Name, Price FROM Products;
Name | Price |
---|---|
Camera | $299 |
Sofa | $699 |
Dining Table | $1299 |
Toaster | $79 |
Sorting
Now that we've seen how to use the SELECT
statement in SQL to retrieve rows of data from a database table, let's take one step toward doing something more meaningful. It's pretty rare that we'll want to get all records in a database table, in exactly the order that the records were inserted – this is what we've done thus far.
We'll often want our records in our 👶 baby table to be in a specific order – perhaps alphabetically or sorted by a numeric value. This is going to be based on how the data is ultimately consumed – e.g. in a PDF report for a stakeholder in your company, displayed on-screen for a user of a web application, and so on. Let's continue with our question and answer format for this one:
Can we order the results by price?
- Identify the table that holds the information you need –
Products
- Identify the columns that hold the information you need –
Name
andPrice
- Add any other conditions – sort by
Price
- Then, ask the question:
SELECT Name, Price FROM Products ORDER BY Price;
Name | Price |
---|---|
Toaster | $79 |
Camera | $299 |
Sofa | $699 |
Dining Table | $1299 |
As we can see, ascending order (from the least to the greatest) is the default when sorting a numeric value. (It's alphabetically from A-Z with a non-numeric value.)
Can we order by the more expensive products first?
- Identify the table that holds the information you need –
Products
- Identify the columns that hold the information you need –
Name
andPrice
- Add any other conditions – sort by
Price
in descending order - Then, ask the question:
SELECT Name, Price FROM Products ORDER BY Price DESC;
Name | Price |
---|---|
Dining Table | $1299 |
Sofa | $699 |
Camera | $299 |
Toaster | $79 |
Limiting
Another common use-case is that we want to limit the number of rows in our 👶baby table – what we've done so far is just fine for our very small table with 4 rows of data. But what if our Products
table contained 10 million rows? The SELECT
statements we've written thus far would be very slow, and not likely to be very helpful or meaningful to the person receiving the results. So we must often reduce our results, and that usually means refining the question we're asking.
What is the single most expensive product we sell?
- Identify the table that holds the information you need –
Products
- Identify the columns that hold the information you need –
Name
andPrice
- Add any other conditions – sort by
Price
in descending order and only return 1 row - Then, ask the question:
SELECT Name, Price
FROM Products
ORDER BY Price DESC
LIMIT 1;
Same results as before – but only returns a single record, answering our question. Also note that we're separating each fragment of the SQL statement here – AKA clauses – by putting each on its own line. As our SQL becomes more complex, I've found this easier to read and comprehend – especially when first learning – so we'll do this from now on.
Lab 1: The Basics
Let's play ball! To begin, proceed to https://github.com/entr451-spring2024/sql-intro and open the project in Gitpod by clicking the Gitpod button. Then, in Gitpod, open a new Terminal session by going to the hamburger menu -> View -> Terminal. You can also try the Toggle Terminal keyboard shortcut. Slide the window containing the Terminal window so that it's larger, if you like, and begin by typing:
sqlite3 baseball.sqlite3
The database software that we'll be using in this lesson is SQLite, version 3. The sqlite3
command opens an interactive SQLite application where we can type SQL commands into the Terminal and, in this case, opens the baseball.sqlite3
database, which contains data about, well... baseball! Try it!
SELECT first_name, last_name FROM players;
Don't forget that semi-colon! You should see a list of all the first and last names of all the baseball players in our database. You can also run a few SQLite-specific commands (i.e. not SQL, but commands that are utilities for the SQLite3 software itself). These commands start with the dot character (.
), like this:
.schema
The output you see shows you the three tables contained within our database, and the structure of the tables (we'll get more into the structure of tables later). Again, these SQLite commands are not SQL – so they do not end with the semi-colon.
You can also run SQL queries from files. Open demo.sql
and you'll see the same query you just tried above. Run this query by issuing the following command from within SQLite:
.read demo.sql
This simply reads the SQL in the specified file, executes it, and displays the results on-screen.
You can also turn "column mode" and headers on, so that the results of your query will appear in a table-like format:
.mode column
.headers on
Browse the files named lab-1-*.sql
– each file contains a question to be answered. See if you can write the SELECT
statement necessary to answer each question, and execute each file using the .read
command, as outlined above. For example, to run the first lab, write your SELECT
statement in the lab-1-1.sql
file, then execute it from within SQLite:
.read lab-1-1.sql
The output produced by the SELECT
statement will be displayed on-screen. Note: by default, Gitpod will only display the last 1000 lines of output.
See the answers when you're done.
Filtering Data
The ability to filter data is probably the most powerful and often-used feature of SQL. For many people, gaining insights from arbitrary sets of data is often the objective of learning SQL in the first place.
We've already learned about the LIMIT
clause, which simply whittles down the size of our result set – the 👶 baby table – by some number that we provide. However, we often want to specify criteria for filtering the results, rather than a set number like 10
. What are our best-selling products? Which sales tactics are producing the best results this month? How many orders did we receive for the new widget this week?
For questions like these, we must use the WHERE
clause. The WHERE
clause allows us to add more criteria to the questions we ask. Let's jump into an example using our Products table:
Which products cost more than $500?
- Identify the table that holds the information you need –
Products
- Identify the columns that hold the information you need –
Name
andPrice
- Add any other conditions –
Price
is more than $500 - Then, ask the question:
SELECT Name, Price FROM Products WHERE Price > 500;
Name | Price |
---|---|
Dining Table | $1299 |
Sofa | $699 |
A WHERE
clause is going to always have a column name (in this case, Price
), a comparison operator (>
or "greater than") and a value to compare (500
). Because the column in question contains numbers, it's going to compare the Price
of every row in the table against the numeric value 500
and filter the results to only include those with a price greater than that number. Let's look at what happens with a non-numeric comparison:
Which products are sold in the Furniture department?
- Identify the table that holds the information you need –
Products
- Identify the columns that hold the information you need –
Name
andPrice
again, but we probably wantDepartment
now too - Add any other conditions –
Department
is Furniture - Then, ask the question:
SELECT Name, Price, Department
FROM Products
WHERE Department = "Furniture";
Name | Price | Department |
---|---|---|
Sofa | $699 | Furniture |
Dining Table | $1299 | Furniture |
Again, we have a column name (in this case, Department
), a comparison operator (=
or "exactly equals") and a value to compare ("Furniture"
). Because Department
is non-numeric, the value to compare must be in quotes. This is necessary to distinguish the comparison value from columns name, other SQL clauses, variables, and the like.
Separate multiple filters/conditions with the AND
clause. So if we wanted the answer to:
Which products are sold in the furniture department and cost more than $500?
SELECT Name, Price, Department
FROM Products
WHERE Department = "Furniture"
AND Price > 500;
Lab 2: Filtering
See the files named lab-2-*.sql
for the questions, and modify each file until you achieve the desired result. See the answers when done.
A Star is Born
We've already seen how SQL queries require us to provide the columns that we want to be returned in the 👶 baby table – and how we often need to think about what columns those are when providing clean, concise output. Occasionally though, the columns don't really matter. In these cases, we can use the asterisk or star (*
) to denote that we want all columns in a table. Consider a Reviews
table, which contains reviews about our products. If we simply want all columns, all rows from this table, we could write a query that looks like:
SELECT * from reviews;
And the result would be everything – all columns and all rows – from that table.
Product | Rating | Body |
---|---|---|
Camera | 4 | Takes nice pictures! |
Sofa | 5 | Comfy! |
Camera | 5 | Best camera ever!!! |
Toaster | 3 | Makes toast. |
Product | Rating | Body |
---|---|---|
Camera | 4 | Takes nice pictures! |
Sofa | 5 | Comfy! |
Camera | 5 | Best camera ever!!! |
Toaster | 3 | Makes toast. |
Product | Rating | Body |
---|---|---|
Camera | 4 | Takes nice pictures! |
Sofa | 5 | Comfy! |
Camera | 5 | Best camera ever!!! |
Toaster | 3 | Makes toast. |
In practice, we're rarely going to use it like this, unless a table contains just a few columns, or when we don't care about concise output, like when we're testing our queries. Or, within an aggregate function.
Aggregate Functions
Often, the questions you want answered will involve math. That is, instead of simply seeing a subset of the rows in a table, we'll want to aggregate the results together in some way, e.g. adding the number of rows, summing a single column, seeing an average, etc.
How many reviews have been written?
- Identify the table that holds the information you need –
Reviews
- Identify the columns that hold the information you need – not applicable in this case, since we're looking for an aggregate number
- Add any other conditions – none
- Then, ask the question:
SELECT COUNT(*) FROM Reviews;
COUNT(*) |
---|
4 |
There are a couple of crazy things going on here! Let's break it down.
First, instead of a column name like Name, Price, Body, etc. in our SELECT
clause, we're calling a SQL function. There are many SQL functions built into SQL, but they will always contain a set of parentheses, which denote arguments to the function. In this case, we're using the COUNT
function, which accepts an argument of "what columns are we counting?" Since we're simply looking for the aggregate number of rows in the table, the columns aren't relevant and we just use *
.
And even though we're asking for an aggregate, that is, a single number representing the count of all the reviews, our result is still a 👶 baby table. It simply contains a single column – called COUNT(*)
– and a single row. We can call several SQL functions, or a combination of SQL functions and column names, which will result in a 👶 baby table containing several columns.
Let's look at another aggregate function, where the argument does matter:
What is the average rating across all reviews?
- Identify the table that holds the information you need –
Reviews
- Identify the columns that hold the information you need – an average
Rating
- Add any other conditions – none
- Then, ask the question:
SELECT AVG(Rating) FROM Reviews;
AVG(Rating) |
---|
4.25 |
Similar as before, with our COUNT
function, but this time, the argument matters. We're using the AVG
function, which takes a column name as an argument – which column are we averaging?
As mentioned, there are many functions built-in to SQL. The ones that are relevant to this lesson are aggregate functions – ones that do math to aggregate information. Examples of aggregate functions include:
COUNT
- counts the number of rowsAVG
- average of a column value or expressionSUM
- sum of a column value or expressionMIN
- minimum of a column value or expressionMAX
- maximum of a column value or expression
Grouping
Occasionally, a question like How many reviews have been written? will be enough to satisfy business requirements. However, it's more likely that your boss, client, or other stakeholder will want more specific information than simply aggregating the entire table's worth of data. We'll often want to categorize or aggregate (group) results by the value of another column. For example, a more common question might be:
How many reviews have been written per product?
- Identify the table that holds the information you need –
Reviews
- Identify the columns that hold the information you need – the
Product
and the aggregate number of reviews - Add any other conditions – none
- Then, ask the question:
SELECT Product, COUNT(*)
FROM Reviews
GROUP BY Product;
Product | COUNT(*) |
---|---|
Camera | 2 |
Sofa | 1 |
Toaster | 1 |
We see that the values separated by commas in the SELECT
clause, whether they are column names or SQL functions, end up being the columns in the 👶 baby table. And, because we used a GROUP BY
clause, our results are aggregated (grouped) by the distinct values in the Product
column.
This brings us to Rule #1A, the First Law of all SQL... when using a GROUP BY
clause, the columns listed in the GROUP BY
must also be in the SELECT
clause. Makes sense, right? In this example, if the Product
column wasn't also in the SELECT
, the results would just be a list of numbers with no context:
SELECT COUNT(*)
FROM Reviews
GROUP BY Product;
COUNT(*) |
---|
2 |
1 |
1 |
Not very useful, is it? And, just to be clear, it's perfectly fine to have columns in the SELECT
that are not in the GROUP BY
– just not the other way around.
Order of Clauses
So far, we've seen several different types of SQL clauses. In fact, we've seen enough to warrant talking about the order of these clauses. Because of the way SQL is executed, the order in which the clauses are written does matter. This may differ slightly based on the flavor/implementation of SQL being worked with (i.e. some are more forgiving than others), but generally speaking, the order should be as follows:
SELECT
- the columns or executed functions you want in the resultFROM
- the table(s) involvedWHERE
- the conditions or constraintsGROUP BY
- columns in theSELECT
used to aggregate the resultsORDER BY
- columns by which the results shall be ordered, either numerically or alphabetically
To put it all together:
What is the average rating per product, taking into account only those reviews with a rating of 3 or above, from lowest to highest average rating?
- Identify the table that holds the information you need –
Reviews
- Identify the columns that hold the information you need – the
Product
and the aggregate average rating of reviews - Add any other conditions – rating of 3 or above
- Then, ask the question:
SELECT Product, AVG(Rating)
FROM Reviews
WHERE Rating >= 3
GROUP BY Product
ORDER BY AVG(Rating)
Product | AVG(Rating) |
---|---|
Toaster | 3 |
Camera | 4.5 |
Sofa | 5 |
Lab 3: Aggregation
See the files named lab-3-*.sql
for the questions, and modify each file until you achieve the desired result. See the answers when done.
#BabyTable