Lesson 2: Data Sets and Libraries
SAS stores data in “SAS data sets,” using its own internal database format. If you are familiar with any database software (Oracle, Sybase, Dbase, MS Access, etc.), you will find that data sets correspond to tables in a database program. The rows are called observations (database "records"), and the columns are called variables (database "fields"). In the example data set below, x, y, z, and w are variables, while 1, 2, 3, and 4 are observation numbers. It is also becoming more common to use the database terms with SAS, so it is good to be aware of both terminologies. Rows=Observations=Records. Columns=Variables=Fields.
However, data sets contain more than just variables and observations. Such things as how values are to be printed, labels to use instead of variable names, and sorting and indexing information can also be included.
As you learn about SAS, you will see that SAS gives you a great deal of control over almost any aspect of your work that you might think of. However, greater control is obtained by using more complex program statements. So, SAS has many "default" settings that save you from extra work and headaches as long as you are satisfied with what is specified by the default. Therefore, many standard actions can be accomplished by very short SAS commands. When there is need for more control, additional commands are available. The first of these "defaults" that we will encounter concerns where data sets are stored.
SAS stores data sets in libraries. (They are really just computer directories or folders, but the term comes from the mainframe language.) There is a default library called “work.” This is where the data will be placed if you don't specify another location. However, work is a temporary library, that is, when SAS is closed, all data sets in work are deleted. However, if you want a data set to be saved permanently, such as in your “My Documents” folder or anywhere else you wish, you can tell SAS to designate a permanent library using a libname statement. The libname statement will look something like this:
There are three parts to this command. The first part is the keyword "libname" which tells SAS you want to define a new library. The second part, "myownlib" is the name that SAS will use to refer to this library. This is called the libref. You can supply any name you like here, as long as it meets the guidelines for allowed names, which are: you can use letters, numbers, and underscores, but can't start with a number. A libref can only be eight characters long, though SAS variable names can be up to 32 characters. The third part of the statement is the pathname (DOS style) showing the actual location of the folder or directory you want to use. It is enclosed in quotes (single or double). On some Windows systems, 'Desktop' and 'My Documents' can be used and will automatically be assigned to the correct path. 'A:' usually works to assign the floppy drive, but only if there is a disk in the drive (applies to any removable volume). Note: The libname statement does not create a folder. It essentially creates a shortcut or alias to an existing folder. We call the library "permanent" because the data is not deleted when SAS closes, but the libref itself will normally have to be recreated in future SAS sessions.
The illustration below shows the SAS Explorer pane (from the left side of the SAS window, note the tab at the bottom) displaying the currently defined libraries. Some of these libraries are standard in SAS. The one called "Rao" has been created with a libname statement. Note that "Work" is explicitly listed.
If you double-click the "Work" file drawer, the data sets in "Work" become visible. The spreadsheet icon in the example below represents a data set named "One." If you double-click on "One" it will open in a "Viewtable," a spreadsheet-like view.
The viewtable has two modes, "Edit" and "Browse." In browse mode, you cannot change the data. To switch modes, go to the "Edit" menu and select the mode you want. A new data set can be created by selecting "Table Editor" under the "Tools" menu.
(Note: In order to go back to previous windows in the explorer pane, you press the folder with the up arrow on it. But this will disappear if the "Explorer" is not the active window. Click on the "Explorer" window to bring it back.)
(Note: There is also another "Explorer" under the "View" menu. This opens a window much like a Windows Explorer.)
When you create a SAS data set, you give it a name, such as "One" in the example above. Names can contain letters, numbers, and underscores, but cannot start with a number. They can be up to 32 characters long. You can refer to a data set in the work directory by its name alone, because that is the default location. But data sets are actually identified by “two-level names,” where the library is given first, followed by a dot, then the dataset name. In other words, the form is libref.datasetname. Since work is the default library, datasetname alone is equivalent to work.datasetname. In order to store a dataset permanently, specify a two-level name, with a first level being a defined libname other than work.
In rare cases, you may want to change the default library to something other than work. To do this, use the special libref "user" (libname user 'C:\folder1\myfolder';). This allows you to use a one-level name with a permanent library that you specify. It does not change which libraries are permanent and temporary. To create temporary data sets when the default library has been changed, use a two-level name with "work" in the first position.
There is also a way to export data in some common file formats. See "File-->Export Data."
In Enterprise Guide, the datasets appear in the project tree under the code that created them. They will be saved when you save your project. (But when you run code you may get a choice of whether or not to replace the existing data. This may affect what is saved in the project.)
Getting Data Into SAS
Most of the time, when you begin working on a project in SAS, your data will not be in a SAS data set. In order for SAS to perform analysis tasks, the data will be need to be brought into a SAS data set. Because data may come in so many different forms, SAS is very flexible and provides a variety of ways to do this. We will keep our first examples simple, but be assured that SAS can handle very complex data reading tasks!
The simplest way to get data into a SAS data set is using “instream data.” This means the data is included “in the stream” of the programming statements that will load it. Here is an example. An explanation of the program follows.
The first two lines are comments. A comment is simply text inserted into computer code that the programming language will ignore. Comments are often used to explain what is happening in the program (for others or for future reference). Sometimes they are used to temporarily disable some statements in the program. Sometimes they are used to "beautify" or enhance the readability of the program. In SAS, there are two ways to write comments. A statement that begins with an asterisk and ends with a semicolon is a comment. This type of comment will only work for one statement at a time. If a larger section of a program is to be commented, that is, multiple statements in one group, you can use "/*" to begin the comment and "*/" to close it. Even semicolons are ignored by this syntax.
The third line contains a libname definition. The library will be called “mysaslib” and the actual folder location on your computer is given in the quote marks.
The fourth line begins with the keyword "data," which tells SAS this is the beginning of a data step. We can see by the two-level name that the dataset will be called “myfirst” and will be stored in the “mysaslib” library.
The fifth line, the input statement, tells SAS what variables to put into the data set. Each variable name may need to be followed by a code that tells SAS what kind of variable it is and how to read it. These codes are called informats. However, there is a default for this, called "standard numeric," which is just an ordinary number in decimal form, with no commas. Since "age" fits this description, we do not have to include an informat for it. On the other hand, "name" is a character variable. The "$" tells SAS to read "name" as character variable eight characters long. Character variables have no numeric value, and can contain letters, numbers, and most other symbols. They are also known as "strings."
The “cards” statement tells SAS that the list of data to read is coming next ("datalines" and "lines" may be used as synonyms for "cards"). The data are organized in a straightforward way. Each observation is on one line and the variable values are separated by spaces. This is called “list input.” A semicolon on a line by itself indicates the end of the data. (The data are not program statements, so there are no semicolons in the data list.)
In addition to comments, we can use indenting to make our programs more readable. For example, the statements under a data or proc statement can be indented to make the the steps look like an outline. Data given in cards should be placed along the margin, though.
Next, SAS encounters a proc statement, and will therefore compile and execute the data step before going on. The dataset “myfirst” is now created and populated with three observations.
continues by compiling the proc step, which consists of only the
“proc print” statement.
Without any other commands specified, this will cause the default action
of printing the most recently created data set, which, of course, is “myfirst.”
Print does NOT mean "print to the
printer." Proc print produces a formatted "printout" of the data set in
the output window. You can save or print (really) this output using File Menu
commands. The result looks like this:
In case you do not want to use the default (last created) data set, or just want to your program to be more obvious to the reader, you can specify the data set that proc print will use this way:
SAS provides several ways to modify the appearance of the output it produces. Notice that in our example the heading "The SAS System" together with the time, date, and page number appear at the top of the page. "The SAS System" is the default page title. You can supply your own titles by using title statements, as shown here:
You can have multiple lines of titles. Just add more title statements with higher numbers. Title statements are global; they don't belong to a particular proc and are in effect until changed or deleted. Redefining a title deletes all previously defined titles of that number or higher. To delete a title without replacing it, just include a blank title statement, like "title3;" . This will delete the old title3, as well as title4 or any other higher-numbered titles.
Producing HTML (web page) output
Both PC SAS and Enterprise Guide can produce HTML or text output. Enterprise Guide displays HTML by default:
Pretty, isn't it? To get HTML in PC SAS, you can go to "Tools-->Options-->Preferences," click the "Results" tab, then check "Create HTML."
Submit one Word document with all answers either typed in or copied in.
1. Indicate which of the following are valid names for variables and librefs (two questions).
2. If a database has a table with 500 records and 14 fields, how many observations and variables would a SAS dataset containing the same information have?
3. If a spreadsheet had 5 columns and 4 rows, how many observations and variables would a SAS dataset containing the same information have?
4. Looking at the screen print below, answer the following questions:
5. Copy the data below into the SAS editor (use copy and paste) and write a data step to read it, followed by a proc step to print it (to the output window). Make sure the printout matches the original data. The variables are Name, Age, and Grade. Age and Grade are to be read as numeric variables. Allow the data to be saved to the work directory. (Submit Program, Log, and Output.)
Marissa 13 7 Andy 7 1 Martha 9 3 John 10 4 Larry 11 6
6. Copy the following data into the SAS editor and write a data step to read it into a data set. Print the results in html format. The variables are Field, Fertilizer, and Variety. Submit the Program, Log, and the html output instead of the normal output.
1 A Magnus 2 B Arbin 3 A Carver 4 B Visser 5 A Turnip 6 B Danun
7. Create a libref for a folder on your hard drive and another for a floppy disk. (If you don't have a floppy disk, you may use a pen drive). Modify the program in problem 5 so that the data set is saved in each location. Use the explorer window and the viewtable to verify that it is actually there. (Submit your program and log for this problem.)
8. Submit only the program statements (editor) for this problem. Note that to "reference a data set" means to tell a proc, like proc print, which data set to use.
Copyright reserved by Dr. Dwight Galster, 2006. Please request permission for reprints (other than for personal use) from firstname.lastname@example.org . "SAS" is a registered trade name of SAS Institute, Cary North Carolina. All other trade names mentioned are the property of their respective owners. This document is a work in progress and comments are welcome. Please send an email if you find it useful or if your site links to it.