Lesson 22-23: SQL
SQL stands for "Structured Query Language." SQL is an industry standard, that is, it is not something invented by SAS, but is used by all major software companies that deal with databases. Although each company has its own "enhancements," the basic language is the same for all of them. Thus, what you learn here will be applicable in Microsoft Access, Oracle, Sybase, and many others.
As we have seen in an earlier lesson, in database terms, a data set is called a "table," an observation is called a "record," and a variable is called a "field." Most of the work in databases is done by means of "queries." Query, of course, is a word related to "enquire" and "question." A query, more or less, is a question asked of a database, for which the answer is usually some portion of one or more tables, or a summary of the data in the tables.
SQL is used in SAS by invoking Proc SQL, then submitting queries. Proc SQL is interactive, like proc plot and proc reg, so it will continue to accept queries after the initial statements are submitted. In fact, proc SQL does not require run statements to work. Each query is executed immediately upon submission. Proc SQL continues running until another proc step or data step is encountered, or a quit statement is submitted.
We will once again refer to the used cars data set for some examples. Here is a simple query to start with.
A basic query begins with the keyword select. The idea is that we are going to "select" something from the database that we want to display. The results of the select statement are displayed in the output window. A select statement will have, at least, a list of variables to display, and a source table, given in the from clause. Unlike most lists in native SAS language, items in SQL lists are separated by commas. This is one of the most common sources of mistakes for beginning programmers. SQL is intended to be fairly "plain English" in nature, so we may be tempted to put commas between clauses, as in English, but SQL only uses commas for list separators. The next example shows how to form a list of variables, as well as how to use a where clause to restrict the records that are selected.
There is a shortcut if you want to get all the fields from a table:
In the next example, we show that new fields can be created by using functions (or any mathematical expressions). The count function counts the number of times a value occurs. The "group by" clause is added to the end so that the counts, as well as the sums, apply to each make. The effect is similar to the by statement we have seen in other procs. The keyword "as" is used to assign a name to the new field. This syntax can also be used to assign aliases for existing field names. Also included is a format for the price. Note that the result includes one row for each value of make, the group by field. Be careful not to select any other field which has differing values within one make, because that will force the output to give multiple rows for each make. As long as the results are unique within the group by values, you will get one row for each. Also, if you leave out the group by clause, you will get one row for each row in the table, and all the "number" and "totprice" values will be the same, the grand total for the whole table.
You can also add an "order by" clause to sort the results differently. The example below will sort the above output by totprice. (To reverse the order, put "desc" AFTER the sort field.) As queries get more complex, writing style becomes important to keep track of the various parts of the query. It is good to start each clause on a new line, and if the lists are long, put each item on a separate line, with appropriate indenting.
The above examples show that SQL provides useful and easy ways to extract both detail and summary information from data sets (tables). But some of the most powerful uses of SQL have to do with extracting information from multiple tables. There is some similarity between what we will do here and what can be done with a merge statement in a data step, but SQL does things differently and usually more efficiently. One major difference is that SQL never requires data to be sorted in advance.
In database systems, much attention is given to a process called "normalization." This essentially means that data are split up across multiple tables in order to avoid redundancy. For example, in a sales database, you might have one table that lists customers with their contact information, another table that lists the salesmen's information, and a third table that gives individual sales, including one field to link to the customer table and another field to link to the salesmen table. Then a query might be written to extract a particular sale, including the customer information and the salesman's information for that sale, getting some of the information from each of the three tables. To demonstrate how this works, we will split up our usedcars data into two tables. In this data, all of the model names are unique to one make, so model will be the linking field. We can use SQL to create these new tables, with, can you guess? "create table" queries. Here's one. (The create table query doesn't send anything to the output window.)
And here's the other.
So now we suppose that these are the two tables we have to get our data from, and we want to print a report with the make, model, year, price, and miles. In the list of fields selected, we now use two-level names to specify which table the variables come from (actually this is only necessary for those in more than one table). The asterisk can still be used to request all the variables from one table. In the from clause, we list all the tables being used. In the where clause, we "join" the tables by specifying which fields to match up in the two tables.
The where clause can contain the equality that defines how the tables are joined as well as other conditions or restrictions, combined by using logical expressions. More than two tables can be joined, and they need not be joined by the same field. Again, logical expressions can be used to combine these conditions. In case there are values that have no match, no rows are included for them. This differs from the merge statement in a data step, where missing values are generated in such a case. However, SQL can produce the same effect, with something called an "outer join," if needed.
Use SQL and the School Sales
data as the
basis for the following exercises.
Do not create any new tables except for #7.
Also, there should be no other procedures used, nor any data steps beyond those that are in the program schoolsalesdata.sas which is linked above..
1. Display the complete contents of the "sales" table, sorting by name.
2. Display the name and grade of students in the 11th and 12th grades only, sorted by grade from highest to lowest.
3. Display each student's name and her goal.
4. For each student, display name, grade, and total sales.
5. Run the
following five queries. In each case, explain what SQL is counting, and why
the results are different (or similar).
a) select count(grade) as numgrade from students;
b) select count(name) as numname from students;
c) select count(distinct grade) as numgrade from students;
d) select count(grade) as numgrade from students group by grade;
e) select grade, count(grade) as numgrade from students group by grade;
6. For each grade, display grade, goal, number of students, and total goal (students*goal).
7. Create a new table that contains the name and sales (only
those two fields) of students in the 12th grade. Use a select statement to
display the results.
Copyright reserved by Dr. Dwight Galster, 2006. Please request permission for reprints (other than for personal use) from firstname.lastname@example.org . "SAS" is a registered trade name of SAS Institute, Cary North Carolina. All other trade names mentioned are the property of their respective owners. This document is a work in progress and comments are welcome. Please send an email if you find it useful or if your site links to it.