- Overview
- Walk-in Office Hours
- Workshops & Tutorials
- Consulting & Data Reference
- Software Distributions
- Statistical FAQ
- What programs can I use to extract data?
- When should I use summary-level data?
- What do the file extensions mean?
- What problems could occur in data conversion?
- Where can I find data file documentation?
- How do I read data downloaded into MS Word format?
- How do I access a codebook?
- How do I associate an SPSS or SAS control card with the raw data file?
- How do I create a data definition statement with DBMSCopy?
- How do I download a data file using SPSS?
- How do I download a subset of data using the IQSS Dataverse Network?
- How do I extract a data subset using DBMSCopy?
- How do I read in hierarchical data?
- How do I read in Roper data?
- How do I use an SST data file?
- How do I use EBCDIC, Column-Binary, Packed-Decimal, or other data?
- How do I use DBMSCopy to convert data formats?
- How do I use OSIRIS data?
- How do I load a very large dataset in STATA?
Statistical Support - Overview
Please note that due to staffing cuts from the current budget situation at Harvard, we no longer offer one-on-one statistical consulting services, including helping users with data analysis.
We will update this page if this policy changes in the future.
For data assistance, please [[contact]].
MIT users should contact us using this email form.
Workshops & Tutorials - Overview
We offer workshops on statistical software and data analysis for the Harvard and MIT communities, as described on the IQSS web site Training page. We have also collected various internal and external documentation and links on data and statistics.
If you have material or links to add to this section, please contact us.
Workshop Schedule and Handouts
We offer single session workshops multiple times throughout the year. Our goal is to help users understand statistical concepts and analyze data using statistical software. For complete descriptions of our offerings, see the IQSS web site Training page. To enroll in any of these classes, please contact Diane Sredl at dataclass@help.hmdc.harvard.edu.
Unless otherwise noted, all classes are held in CGIS Knafel Building Room K018.
Introduction to Numeric Data Resources/Introduction to Stata:
**Please Note: First hour is Data Resources, last 3 hours are Introduction to Stata
Wed, Sept. 16th - 9 am to 1 pm
Fri, Oct 2nd - 9 am to 1 pm
Tues, Oct. 20th - 2 pm to 6 pm
Fri, Nov. 13th - 9 am to 1 pm
Regression in Stata:
Fri, Oct. 9th - 9 am to Noon
Tues, Oct. 27th - 2 pm to 5 pm
Fri, Nov. 20th - 9 am to Noon
Data Management in Stata:
Tues, Sept. 22nd - 3 pm to 6 pm
Fri, Oct. 30th - 9 am to Noon
Wed, Nov. 18th - 2 pm to 5 pm
Introduction to R:
Wed, Sept. 23rd - 9 am to Noon
Fri, Nov. 6th - 9 am to Noon
Tues, Dec. 1st - 2 pm to 5 pm
Introduction to SAS:
Wed, Oct. 14th - 2 pm to 5 pm
Fri, Dec. 4th - 9 am to Noon
Attached are PDF versions of the class materials for our workshops.
General Statistical Tutorials
Use the following links and attached files to gain general knowledge on statistical functions and tools:
- Zelig: Everyone's Statistical Software - introduction to the R Project for Statistical Computing, and the easy front end offered by Zelig.
- Reading fixed-field data in Stata - Describes how to read fixed data into stata
- UCLA Resources for learning Stata - From the University of California, Los Angeles
- Raynald Levesque's SPSS Archive - Tutorials, FAQs and examples for SPSS
- SPSS Syntax Guide (See pdf below) - How to create a dataset using spss files, and discusses how to correct common problems in spss syntax files from other sources, such as ICPSR
If you have material or links to add to this section, please contact us.
Qualitative Data Analysis Resources
These topics provide resource information about qualitative data analysis in general.
When are qualitative data analysis techniques typically used?
Some techniques that might require qualitative data analysis include: ethnography, unstructured or semi-structured interviews, focus groups, document review, and participant observation.
Generally you use qualitative techniques when interested in advancing an established theory, generating a new theory, or is conducting research that requires a detailed and intimate understanding of the group of interest.
Additionally, when findings are highly quantitative in nature, researchers could use qualitative findings to illustrate their conclusions.
What are some methods that qualitative data analysts use?
No software can give you a definitive result in your data. However, software are available to help you identify emergent themes in your research.
Researchers often use text analysis software to identify themes, create codes, and link multiple documents.
Further kinds of analysis can include analysis of audio and video material, web sites, emails, web logs (blogs), and other virtual written material.
What are some resources available to analyze qualitative data?
- Our Public Labs (Overview) have software available to help analyze aualitative data.
- The Quarc web site has a good comparison of the four main types of software
- There also is good open-source software available for text-based analysis -
- Text Manipulation, Management, Mining, and Analysis Tools:
- AnSWR
- From the CDC, for mixed qualitative/quantitative analysis.
- EZ-Text
- From the CDC, for textual data analysis.
- Judge Kea
- Performs automatic classification and clustering of documents.
Performs automated key phrase extraction. - Language Archiving
- A hosted service for text management and analysis
- Perl
- The programming language for supreme text mangling.
- SIL tools
- If you have a lot of text on-line, the concordance, indexing, and database from the Summer Institute of Linguistics may be what you need.
- Tabari
- Uses special purpose rules for categorizing news events from new text.
- Tams
- Textual analysis and markup.
- TextStat
- Another indexing/concordance package.
- Weft
- For qualitative data management and coding.
- Weka
- A collection of machine learning algorithms for data/text mining.
- YALE
- A flexible standalone package that contains many data mining algorithms.
- Frequently cited books about qualitative data include:
- Denzin, N. and Lincoln, Y. (1994) Handbook of Qualitative Research, Thousand Oaks, CA: Sage.
- Marshall, C. and Rossman, G. (1999). Designing Qualitative Research. Newbury Park, CA: Sage.
Sources:
Introduction to Data Tutorial
Summary-level data are published data points in either print or electronic format. You would use summary-level data if you were looking for a quick statistic such as the unemployment rate for the current month or if you wanted to see a table of statistics, such as GNP for various countries during a specific time period. Consult the Search Tools for more information on locating summary-level data.
Micro-level data files are the numerically coded results of individual responses to such files as the census questionnaires, public opinion surveys, and more. You have much more flexibility to work with the data and run statistical analyses on the extracted data. The data are in an unanalyzed, raw format of columns and rows, usually in ASCII format but not always. Some raw data files are accompanied by files in SPSS, SAS or other statistical software format for easier use in these packages. If you are working with only the raw data, you must consult the data documentation (codebook) and write a small program or use an extraction program to have the computer read in the data into a useable format. The Harvard-MIT Data Center provides access to a vast collection of social science micro-level data through the IQSS Dataverse Network and the Harvard and MIT libraries have extensive collections of data sets from many different sources. Consult the Search Tools for more information on locating micro-level data.
Codebooks provide information on the structure, content, and layout of a data file and the questionnaire, if any, used for the survey or study. Many codebooks are available electronically with the data file. Littauer Library (Harvard) and Dewey Library (MIT) have extensive collections of printed codebooks for data files from ICPSR, Roper and other sources and these are searchable in the HOLLIS and BARTON online library catalogs, respectively.
Example ASCII Data Files
ASCII data files can be in a fixed or free (delimited) format. For DBMSCopy, or any other program to process the data, you must tell the program how the data file is structured: the name of each variable, the type of the variable, and possibly the position and length of the variable.
Following is an example of a survey with five observations (respondents) and three variables (question responses): AGE, INC(OME), and PARTY. With free format, as in Methods 1 and 2, some special character(s), such as commas, blank spaces, or tabs, separate or delimit the value for AGE from that for INCOME, and so on. As long as the age value always precedes the income value you need not line them up row by row. We can see that the first respondent in this survey is 24 years of age, has an income of $12,345. The codebook (data documentation) would explain how the coding works for the PARTY variable. In this example, 0=democrat and 1=republican. The main drawback to this method of organization is that the extra characters greatly add to the size of an ASCII file for large datasets. Therefore, if you receive your dataset from some outside source, the information might be packed into fixed format.
Method 1: Method 2:
AGE,INC,PARTY
24,12345,0 24 12345 0
45,24118,1 45 24118 1
34,45678,0 34 45678 0
55,112444,1 55 112444 1
29,9999,0 29 9999 0
In fixed format, as in Methods 3 and 4, the values for a particular variable have to appear in the same place for each observation. So if your INCOME value appears as characters 3-8 in a row of data for the first observation, it must do so in later rows as well. To read the data into a program such as STATA or DBMSCopy, you have to identify where the values for each variable are located in the ASCII file. These versions of fixed form data cannot be treated as free form; no simple delimiter separates values on one variable from those of the next along a row (also called a record in many codebooks). This is very much how HMDC data often appear. It is useless without a data dictionary telling you and the computer what the numbers represent.
Method 3: Method 4:
240123450 24012345demo
450241181 45024118repb
340456780 34045678demo
551124441 55112444repb
290099990 29009999demo
Methods 1-4 show data that were formatted so that one line was equal to one record. Some older datasets are organized into cards, and one record may be stored on multiple lines. Method 5 shows an example of five records, each of which takes two lines. In most card data, lines will be exactly 80 characters (and padded with zeros if necessary to make the lines this long).
Method 5:
24012345 | Town Registrar
00000000
45024118 | County Clerk
10000000
34045678 | State Rep. (w/o honoraria)
00000000
55112444 | State Rep. (w/ honoraria)
10000000
29009999 | Justice of the Peace
00000000
Method 6 shows hierarchical data. Each of the person records implicitly belongs to the country that precedes it. But different types of records have different variables and layouts. SAS can handle layouts of this complexity, as can SPSS with programming, but not by using the standard DATA LIST. Stata and DBMSCopy cannot. Another approach is to use a text manipulation langage like Perl to split the files (appending identifiers for each record so that you can continue to match people to counties).
Method 6:
COUNTRYUSA1967
PERSON12233123132313231323321312323132
PERSON12233123132313231323321312323132
PERSON12233123132313231323321312323132
COUNTRYUK1968
Older Data Formats Tutorial
This tutorial is intended to help you to take a dataset in some known but inconvenient format, and convert it to a more usable format, extracting only the variables and cases that interest you. Although this tutorial is intended to be complete enough for you to extract most data without reference to the software manuals, we suggest that you consult the manuals and online help for the programs that you use, before requesting assistance.
For most data extraction, we recommend the programs Stat Transfer and DBMSCopy, although a few tasks may require SAS or TextPipe Pro. All four programs are available at the data center, and using these tools together you can convert almost any statistical format you encounter. However, some data is in obscure, obsolete, or proprietary formats, and we cannot guarantee that you will be able to convert it easily.
Both Stat Transfer and DBMSCopy are programs designed specifically to convert data among a wide variety of statistical file formats. Both are installed in the Concourse Computer Lab (see Locations) and can handle almost all of the data files that we supply. We installed these tools under Windows, and they are available under the Start menu, under All Programs.
Preparing Data
- Download the data.
You can do this with your web browser or FTP client. Transfer files as binary and not as text. Transferring data files as text can cause data corruption. - Uncompress your data.
Some data files are compressed for storage. Filename extensions such as .Z, .gz, and .zip often indicate that the file is compressed. WinZip, available in the labs, can uncompress most files. - Identify the format of the data.
The most reliable way to identify the format is by reading the documentation (codebook, README, LOG, and notes) that accompanies the data.
Sometimes the filename extension identifies a dataset. This is not completely reliable, as the same extension can be used for multiple data types, some datasets are not labelled, and different operating systems use different labels for the same data. Some common extensions are as follows:- .xls - Microsoft Excel Spreadsheet
- .dbf - Dbase II,III, or IV format
- .por - SPSS portable format
- .sav - SPSS pc binary, (also SST binary)
- .ssp, .xpt, .ssd,.sd2 - SAS file formats
- .dta - Stata
- .html, .xml - Tagged text format
- .csv - Comma separated values (delimited)
- .tsv, .tab - Tab separated values (delimited)
- .txt, .asc - Free-text ASCII data or documentation
- .dat - Usually free-text ASCII, can be a variety of formats
Statistical Formats
If the file in a commercial statistical format, such as SAS, Stata, SPSS, or Excel, or is in a delimited format, such as CSV or TSV:
- Run Stat Tranfer.
Select the Start Menu, select All Programs, and then choose Data Apps. - Select the file and choose the appropriate input format in the first box.
- Choose the appropriate output format from the next box.
- Click Transfer.
Tagged Text Formats
There are no universal standards for data stored as HTML or tagged text format. We recommend using Textpipe Pro (located under the All Programs menu Data Apps option) or a programming language, such as Python or PERL. Consult the documentation extensively, and expect some trial and error.
ASCII Free-Text Data File Formats
Many thanks to Stephen Voss for his assistance with this section.
There are no universal standards for data stored as ASCII or free-text format. For DBMSCopy, or any other program to process the data, you must tell the program how the data file is structured: the name of each variable, the type of the variable, and possibly the position and length of the variable.
Functional ASCII data files come in two forms: fixed format and free (or delimited) format. In fixed form, the values for a particular variable have to appear in the same place for each observation. If your AGE value appears as characters 4-6 in a row of data for the first observation, it must do so in later rows as well. This is difficult because, to read the data into a program such as STATA or DBMSCopy, you have to identify where the values for each variable are located in the ASCII file. Therefore, if you create your own dataset you almost certainly want to use the easier free format. With free format, some special characters, such as commas, blank spaces, or tabs, delimit the value for AGE from that for INCOME, and so on. As long as the AGE value always precedes the INCOME value you need not line them up row by row. The main drawback to this method of organization is that the extra characters add greatly to the size of an ASCII file for large datasets. Therefore, if you receive your dataset from some outside source, such as the data center, the information might be packed into fixed format.
Example: A survey with five observations (respondents) and three variables (question responses): AGE, INC(OME), and PARTY. In free format the data could appear in a number of ways.
METHOD 1: METHOD 2: METHOD 3: AGE,INC,PARTY 24,12345,0 24 12345 0 24 12345 0 | Town Registrar 45,24118,1 45 24118 1 45 24118 1 | County Clerk 34,45678,0 34 45678 0 34 45678 0 | State Rep. (w/o honoraria) 55,112444,1 55 112444 1 55 112444 1 | State Rep. (w/ honoraria) 29,9999,0 29 9999 0 29 9999 0 | Justice of the Peace
Note that Method 3 also is in fixed format, since the first two characters represent AGE, the next seven (including blank spaces) represent INC(OME), and the last character is a dummy variable representing PARTY. However, another fixed form example might be:
METHOD 4: METHOD 5: 240123450 24012345demo 450241181 45024118repb 340456780 34045678demo 551124441 55112444repb 290099990 29009999demo
Unlike Method 3, these versions of fixed form data cannot be treated as free form; no simple delimiter separates values on one variable from those of the next along a row (also called a record in many codebooks). This is very much how HDC data often appears; it is unintelligible without a data dictionary telling you and the computer what the numbers represent.
The previous example showed data that was formatted so that one line was equal to one record. Some older datasets are organized into cards, and one record may be stored on multiple lines.
Method 6: 24012345 | Town Registrar 00000000 45024118 | County Clerk 10000000 34045678 | State Rep. (w/o honoraria) 00000000 55112444 | State Rep. (w/ honoraria) 10000000 29009999 | Justice of the Peace 00000000
Method 6 shows an example of three records, each of which takes two lines. In most card data, lines are exactly 80 characters and are padded with zeros if necessary to make the lines this long.
Method 7: COUNTRYUSA1967 PERSON12233123132313231323321312323132 PERSON12233123132313231323321312323132 PERSON12233123132313231323321312323132 COUNTRYUK1968
Method 7 shows hierarchical data. Each of the person records implicitly belongs to the country that precedes it. But different types of records have different variables and layouts.
SAS can handle layouts of this complexity, as can SPSS with programming, but not by using the standard DATA LIST. Stat Transfer and DBMSCopy cannot.
Another approach is to use a text manipulation langage like Perl or Python to split the files, appending identifiers for each record so that you can continue to match people to counties:
#!/usr/bin/perl
$country_count="0";
open (COUNTRYF,">country.out");
open (STATEF,">state.out");
while ()
$recid = substr($_,$RECIDSTART,$RECIDLEN);
if ($recid eq 'COUNTRY')
$country_count++;
$country_field=substr(" ".$country_count, length $country_count-10);
print COUNTRYF $_ . $country_field;
elsif ($recid eq 'STATE')
print STATEF $_ . $country_field;
close (COUNTRYF);
close (STATEF);
Another approach is to split or flatten the files. To combine all the levels of the hierachy into one long record:
#!/usr/bin/perl while () $recid = substr($_,$RECIDSTART,$RECIDLEN); if ($recid eq 'COUNTRY') chomp(); $country_rec = $_; elsif ($recid eq 'STATE') print $country_rec . $_ ;
This latter approach can waste a lot of space. Also, you have to remember that you did this, and take pains not to do things like sum country variables across persons, or you get an overcount.
Preparing an ASCII Dataset
Sometimes, an ASCII table of data contains extraneous whitespace (tabs or spaces) and blank lines. Use TextPipe Pro, available in the lab (on the All Programs menu, select the Data Apps option) or a programming language, like Perl or Python (on the All Programs menu, select the Programming Apps option), to clean these.
The carriage return character signals the end of a line, and the end-of-file character are different in DOS and UNIX or MacOS. You might need to use a program to convert these characters. Microsoft Word, Stuffit Expander, and many unzip and compression utilities make this conversion.
Some ASCII data files, typically, contain the names of each variable in the first line of the dataset. You can tell DBMSCopy to use the first line to define the dataset.
Some data files are in delimited format. This means that each record is written on a separate line, and each line contains a fixed number fields that are separated by a delimiter character, which typically is a comma or tab. You can tell DBMSCopy to interpret the file as a delimited file. If the dataset also contains the names of the variables as the first line, DBMSCopy might be able to convert the file without a dictionary.
Note: DBMSCopy assumes that the delimeter is a comma (","). To change the delimiter to a space, type the space character in the delimiter field entry box on the ASCII Input options panel. To change the delimiter to a tab, type 9 in this entry box.
Other ASCII files are in fixed-field format. This means that each variable starts and ends at a fixed position in the dataset. In this case, you must use a dictionary to tell DBMSCopy how to define each variable. When you open an ASCII dataset in DBMSCopy, it looks for a file with the .dct extension to use as a dictionary. If this dictionary does not exist the program asks you to create the dictionary interactively from within DBMSCopy (see the DBMSCopy help for details).
Note: For card-image data, you must tell DBMSCopy how many lines is in each card, and specify the card as well as the column for each variable. This is often the case for Roper data.
Preparing Binary ASCII Files
The Windows version of DBMSCopy is the easiest to use. Go to the Start Menu, select All Programs, and then choose Data Apps.
- Launch the DBMSCopy program.
- Select the Interactives menu Copy Database option.
- Use the dialog box to specify the location, name, and type of the file.
Be careful to specify the correct file type, because the file name and extension might be ambiguous. - If the data was created by a spreadsheet, the variable names and data might be located anywhere in that spreadsheet. DBMSCopy asks you to specify the data area.
- Choose to view a sample of the file, as DBMSCopy interprets it, to modify or filter the transformation, or to transform the data automatically.
Select one (for this example, use the last option). - Use the dialog box to specify the location, name, and type of the output data.
- Load the new data in your favorite statistical program and check it to make sure that there are no obvious errors, missing variables, or missing records.
Extracting Fields and Records
After selecting the Interactives menu Copy Database option, DBMSCopy will display a dialog box labeled Power Panel. You can use this control menu to select fields and records.
- To extract fields - Select the Power Panel menu Variable Information option.
Then, use the Keep check boxes to keep or drop variables from the list. Note: If you convert data from an ASCII file, you can drop variables simply by leaving them undefined. - To extract Records (subsetting) - Most analysis programs enable you to extract particular fields and records:
- Using DBMSCopy - Select the Power Panel menu Record Filter and Equations option. Then, build a filtering expression to select the records that you want to keep. The built-in help explains these expressions in detail. Here is a simple example of an expression that extracts only records for Alabama in 1990:
select upper(STATENAME) = "ALABAMA" and YEAR =1990; - Using UNIX systems - Use the
cutcommand to extract particular fields from a text file. For example, the following command extracts the first and third fields of a comma (",") delimited ASCII dataset into the extract file:unzip -p datafile.zip | cut -f1,2 -d"," > extract
Check the manual pages on your version of UNIX before attempting this.
- Using DBMSCopy - Select the Power Panel menu Record Filter and Equations option. Then, build a filtering expression to select the records that you want to keep. The built-in help explains these expressions in detail. Here is a simple example of an expression that extracts only records for Alabama in 1990:
Troubleshooting Check List
- File extensions - Filename extensions can be ambiguous, you must properly identify the real format of the file for translation to work.
- Variable types - Some data formats do not support all variable types, information might be lost in the conversion.
- Variable value formats - Some data types do not support all variable formats, information might be lost in the conversion.
- Missing values - Some data formats do not have any way of specifying missing values, DBMSCopy typically translates missing values as blanks if the target data format does not have a code for missing values.
- Variable names - Some data formats disallows particular characters in variable names, or limit the length of names. DBMSCopy truncates variable names that are too long for the target format, and usually replaces illegal characters with the underbar character.
- Size limitations - Some data formats limit the number of variables or records in a dataset. DBMSCopy truncates these. DBMSCopy also can run out of memory trying to process large datasets. If this happens, close all other applications, or expand virtual memory, and run DBMSCopy again.
- Numeric or character variables - A spreadsheet can have numeric values and character values in one column, whereas a variable in statical software is either numeric or character type. When converting a spreadsheet DBMSCopy has to assign variable type to a data column.
- If there is just one character in a column, like a blank in a cell you wanted to mark as empty, DBMSCopy assigns the variable resulting from this column the type character. To avoid this problem, make sure that you only type numbers in a column of a numeric variable. Do not type anything in cells with missing values, just skip them.
Older Data Formats
This section describes other possible data formats that you might encounter.
EBCDIC DATA
Some older data sets may use the EBCDIC character set rather than the ASCII character set. The Linux/MacOS command cat file | dd conv=ascii > output can convert EBCDIC files.
Other Mainframe Data formats
DBMSCopy can convert among most standard spreadsheet, database, and statistical formats, but SAS, PERL, and TextPipe PRO are the primary tools for mainframe and obscure data conversions.
Input formats are tools for reading data into SAS, which is located under the Start Menu, under All Programs, then under Statistical Apps. You must write a small SAS program if you use these formats:
$CBw. Informat- Useful for column-binary data, such as older Roper data.S370FPD, $VARYING, $EBCDIC- For packed, variable-length, and EBCDIC data respectively.
The Proc convert tool can be used with OSIRIS data as described in the EBCDIC Data Conversion section.
The Proc datasource tool is another conversion utility. Here is an example program using it to convert an IMF file (see the SAS help for more information):
libname OUTPUT '/path/to/outdir';
filename dot '/path/to/inputfile';
proc datasource filetype=imfdotsp infile=dot out=OUTPUT;
where country='126' and partner='001';
run;
This tool supports the following formats in SAS 6.12:
- Bureau of Economic Analysis, U.S. Department of Commerce (BEA)
- CITIBASE Data Files
- Haver Analytics Data Files
- International Monetary Fund Data Files (IMF)
- U.S. Bureau of Labor Statistics Data Files (BLS)
- Standard & Poor's Compustat Services Financial Database Files
- Center for Research in Security Prices (CRSP) Data Files
- Organisation for Economic Co-operation and Development (OECD)
- FAME Information Services
You also can use SAS for packed decimal and EBCDIC data. The SAS routines S370FPD, $VARYING, and $EBCDIC convert packed decimal, variable-length, and EBCDIC data respectively.
For OSIRIS format data, use SAS, with this sample code: libname OUTPUT '/path/to/outdir';
filename DATA1 '/path/to/study.data';
filename DICTIONARY1 '/path/to/study.dict';
proc convert osiris=DATA1 dict=DICTIONARY1 out=OUTPUT.newname;
run;
exit;
Note: Both the dictionary and the data file should be in the original EBCDIC format. In many cases, ICPSR has converted the data file to ASCII and added line feeds. To convert it back to EBCDIC, first strip the line feeds ( tr -d \\n < asciidata > asciistrip ), then convert it back ( dd conv=ebcdic < asciistrip > ebcdcdata).
The Convert::IBM390 Perl module is available from public sources and can be used to convert packed-decimal, zoned-decimal, ibm-ebcdic, and other related formats. You need PERL installed on your system to use this module.
Consulting & Data Reference - Overview
Our consultants work to help users access data.
Please note that due to staffing cuts from the current budget situation at Harvard, we are no longer offering one-on-one statistical consulting services, including helping users analyze data..
We will update this page if this policy changes in the future.
For accessing data assistance, please contact us.
MIT users should use this email form for accessing data assistance,
Statistical Consulting
Finding Data
In collaboration with Numeric Data Services (in the Harvard College Library) we provide access to data for research and instruction. We provide integrated access to major subscription data archives, such as the Roper Center for Public Opinion, Inter-University Consortium for Political and Social Research, and Wharton Data Research Services; to major public collections, like the U.S. Census Dataweb and National Center for Health Statistics; and to special collections of interest to affiliates.
Most of the data we manage is available online through the IQSS Dataverse Network. Also see Numeric Data Services Collections for information about Wharton Research Data Services, NCHS, and selected other holdings.
If you do need assistance finding data in the IQSS Dataverse Network, or in locating data from any other sources, reference assistance is available by email, phone, or in person. Please contact us with questions or to request an appointment at dataquest@help.hmdc.harvard.edu.
Downloading Data
Most summary-level data files are available for downloading into such spreadsheet packages as Microsoft Excel or database or statistical software through a standard File-Save command. Some databases might offer downloads into other formats, such as tab-delimited. Spreadsheet packages and other software can read this format easily. For assistance on downloading data from particular web sites or databases, consult the online help or the database documentation. Littauer Library holds an extensive collection of help guides in support of the numeric databases available in the Library.
IQSS Dataverse Network
You can download data from the IQSS Dataverse Network. Many studies are available to the general public, while some are restricted by licensing requirements of the data producer. Your Harvard or MIT authentication information might be required.
Data Formats and Extracts
For most data, the Dataverse Network automatically translates the data to your preferred format, and allows you to select subsets for extraction. The Datverse network also provides statistical analysis online.
If you encounter a problem with downloading, reformatting, or extracting data through the Dataverse Network please use this contact form to report it.
Other Data Formats
Occasionally, producers provide data that is not in a form that can be processed automatically by the Dataverse Network. If you need to use data that is not available in your preferred format, please see Workshops & Tutorials, or contact us:
- Harvard users should contact us at dataquest@help.hmdc.harvard.edu.
- MIT users should contact the consultant using this email form.
Numeric Data Services
Numeric Data Services staff offer data reference by phone, via email, and through scheduled research consultations and reference hours. We also provide library instruction and data instruction. Our data group instruction classes are co-taught with HMDC statistical staff, and are general data courses taught throughout the academic year. Our course-related data classes are taught (upon request from faculty or teaching fellows) to provide an introduction to library data resources. We also are available to teach more general introductory library sessions for Government or Economics related courses.
In-Person Reference
Numeric Data Services staff provide data reference assistance at Research Services in Lamont Library (see web site for hours).
Email a Reference Question
Numeric Data Services staff will answer quick reference questions from our primary user groups (Harvard University students, staff, and faculty) via our email reference queue. This queue is also monitored by our statistical support staff, who can assist users with questions about statistical software and analysis. To contact us via email, send your questions to dataquest@help.hmdc.harvard.edu.
Research Consultations
Staff at Numeric Data Services also are available to meet with students, research assistants, teaching fellows, and faculty for in-depth research consultations. We can help you to identify relevant statistical resources, locate data files, and find journal literature of interest for your research, term paper, senior thesis, or dissertation. Please email dataquest@help.hmdc.harvard.edu to request an appointment.
Data Group Instruction
Staff at the Numeric Data Services and the Harvard-MIT Data Center (HMDC) offer a variety of data classes throughout the academic year. We encourage new data users to enroll in both the Introduction to Numeric Data Resources (see description below) and the Introduction to Stata classes. Classes are held in the Computer Training Lab, CGIS Knafel Building, Concourse level. Pre-registration is required by emailing Diane Sredl, Data Reference Librarian. See Workshop Schedule and Handouts for the schedule.
Introduction to Numeric Data Resources
Learn strategies for locating numeric data for term papers, senior theses, dissertations or other research purposes. Taught by a Data Librarian from Numeric Data Services (Harvard College Library), these courses cover everything from quick look-up sources to micro-level datasets in the social sciences, including those found in the IQSS Dataverse Network repository. There also is time for hands-on practice using Harvard e-resources in Economics, Government/Political Science, Sociology and Health.
Classes are held in the Computer Training Lab, CGIS North, Concourse level. Pre-registration is required by emailing Diane Sredl, Data Reference Librarian.
Course-Related Instruction
We also offer instruction sessions tailored to individual courses. We have provided library instruction sessions, which might include a data component, for courses in Government, Economics, Public Health, Sociology, Religion, and Freshman Writing. If you have any questions or would like to schedule an instruction session, please contact Diane Sredl, Data Reference Librarian.
Other Consulting
Other data and statistical consultants are available to assist users in specific areas.
Numeric Data Services
Reference assistance is available in person at Research Services in Lamont Library (see Hours). Please contact Diane Sredl, Data Reference Librarian, via email or at 617-496-6936 to schedule an appointment.
Center for Geographic Analysis
The CGA Help Desk is available at both the Cambridge and Longwood campuses on Tuesdays from 1:30 to 4:00 PM, and at other times by appointment. Contact them at contact@help.cga.harvard.edu.
Software Distributions - Overview
We distribute several software packages that are made available by agreement with the software makers. These software packages are not free, but generally at a greatly reduced cost for Harvard community members. Please see the individual software packages to see details and cost information.
MIT community members are not eligible for any of the specific distribution plans at this time.
For more information, please contact us.
EViews
Through our EViews departmental license, we offer IQSS affiliates the ability to purchase EViews software and manuals at a reduced cost. Please note that this offer is not available to the general Harvard community due to restrictions placed by the software maker.
For details about the cost or questions about eligibility, please contact us.
JMP
We renewed the JMP site license for Harvard for FY09 (ending July 14, 2009). Currently we have the following software licensed:
- JMP 8.0.0 for Windows
- JMP 7.0.2 for Windows
- JMP 7.0.2 for Macintosh
- JMP 7.0.2 for Linux
We also have licenses for JMP 7.0.0 and 6 if you do not choose to upgrade.
Fees
We charge a small fee for users because the site license costs us more money than just a departmental license. Here are the FY09 fees:
- Single User (one platform) - $15
- Single User (multiple platforms) - $20
- Small Group, Lab...no more than 10 installations (multiple) - $50
- Department Wide (multiple) - $200
- School (multiple) - $500
- Medical Area Folks - See Below
You receive a CD for installing the version for the platform of your choice. Print manuals are not included, but PDFs are part of the installation packages.
Fees must be paid by check (made out to Harvard University), Crimson Cash, or 33-digit billing code (including group and financial contact with code). You must pick up your software at:
Concourse Computer Lab Help Desk
CGIS Knafel Building, Room K026
Phone number x6-9365
Our hours are:
- Monday through Thursday, 8 AM to 9 PM
- Friday, 8 AM to 6 PM
- Saturday and Sunday, 12 PM to 5 PM
Please let us know if you are coming ahead of time by emailing us at jmp_support@help.hmdc.harvard.edu.
Medical Area Personnel
The Research Information Technology Group (part of HMS IT) graciously has bought into the license for the medical community and is making the software available from their downloads site:
http://wiki.med.harvard.edu/Software
Note: You must have a valid e-Commons login to get the software from this site.
If you have questions or problems with this site, please use the form at the following URL:
http://ritg.med.harvard.edu/support/
Stata
We volunteer to function as the Harvard distributor of Stata's Grad Plan. The Concourse Computer Lab Help Desk in CGIS Knafel Building, room K026, is the pickup point for all Harvard Grad Plan orders.
What is the Grad Plan? Simply put, Stata sends stock for all their GradPlan offerings to a campus volunteer. When you order GradPlan software and manuals from Stata's web site, you pick them up from the campus volunteer instead of having Stata ship to you directly. Since the materials already are on campus, turn around is fast--usually within 36 hours.
Fees
A comparison of Perpetual License costs for the GradPlan vs University Information Systems is:
- Stata 10 Intercooled - $155 vs $178
- Stata 10 SE - $335 vs $385
- GradDoc Se - $179 vs $206
- User Guide - $35 vs $40
Also available through the Grad Plan are student-only, single-year licenses of Small Stata and Intercooled Stata 10. Stata also makes Stat/Transfer from Circle Systems available through their Grad Plan.
For complete information on the GradPlan visit the following page:
http://www.stata.com/order/new/edu/gradplan.html
Note: All ordering and sales inquiries are handled by Stata. We volunteer to distribute the GradPlan for the Harvard campus. We cannot answer purchasing questions, but Stata has responsive and helpful staff available by email or phone to answer your questions and take your orders.
To order via the GradPlan, go to the following page:
http://www.stata.com/order/schoollist.html
Stata Pick Up
If you plan to pick up the software at our office, choose Harvard University as your school and proceed through the order. An email is sent to you confirming the order. After Stata processes your order, we receive an email with your ordering information and relevant licensing codes (if applicable). This process usually takes up to 36 hours. We then contact you and let you know that we received your information and when you can pick up your order.
If you need to ship the software shipped directly to youself, choose Harvard University - Extension School as your school and proceed through the order. You are given shipping options. Remember, the turn around for this option is longer and you pay for shipping.
Affiliates Without Harvard Email Addresses
Generally sales staff at Stata is helpful for those with easy to understand links to Harvard. If you do not have a Harvard email address, first try the web order form. If that is rejected, try to order Stata by phone and explain your affiliation. If that does not work, contact us.
What do the file extensions mean?
Sometimes the filename extension identifies a dataset. This is not completely reliable, as the same extension can be used for multiple data types, some datasets are not labelled, and different operating systems use different labels for the same data. Some common extensions are as follows:
- .txt - ASCII data or documentation
- .dat - Usually ASCII, can be a variety of formats, including ebcdic
- .xls - Microsoft Excel Spreadsheet
- .dbf - Dbase II,III, or IV format
- .por - SPSS portable format
- .sav - SPSS pc binary, (also SST binary)
- .ssp - SAS transport file
- .ssd/.sd2 - SAS for pc/windows
- .sda, .dta - Stata
- .ebc, _ebcdic - The EBCDIC binary format
What problems could occur in data conversion?
- File extensions - Filename extensions can be ambiguous, you must properly identify the real format of the file for translation to work
- Variable types - Some data formats do not support all variable types, information may be lost in the conversion
- Variable value formats - Some data types do not support all variable formats, information may be lost in the conversion
- Missing values - Some data formats do not have any way of specifying missing values, DBMSCopy typically translates missing values as blanks if the target data format does not have a code for missing values.
- Variable names - Some data formats disallows particular characters in variable names, or limit the length of names. DBMSCopy truncates variable names that are too long for the target format, and will usually replace illegal characters with the underbar character.
- Size limitations - Some data formats limit the number of variables or records in a dataset, DBMSCopy truncates these. DBMSCopy also might run out of memory trying to process large datasets. If this happens, close all other applications, or expand virtual memory, and run DBMSCopy again.
How do I access a codebook?
- ICPSR codebooks - Many ICPSR codebooks are available eletronically in PDF format along with the data files, which are accessible via the HMDC Virtual Data Center. Printed codebooks are available at Littauer Library (Harvard) and Dewey Library (MIT) and these are searchable via the Libraries' catalogs. The codebook collection circulates for the same period as regular books.
- MIT codebooks - Harvard graduate students and faculty may borrow codebooks from MIT libraries and vice-versa under the standard Harvard-MIT reciprocal library agreement. You will need to obtain a borrower's card first. Please talk with the library staff at your institution. The standard library agreement does not cover graduate students during the summer, or undergraduate students at anytime. As a service, the Data Center has negotiated a special arrangement for these users. Littauer Library at Harvard and Dewey Library at MIT have been supplied with inter-library loan forms for foreign patrons, and will provide an instant inter-library loan of codebooks to these users who are not covered by the standard reciprocal agreement.
- Roper codebooks - Some Roper codebooks are available electronically with the data files, and these are accessible via the IQSS Dataverse Network. Littauer Library holds a collection of printed Roper codebooks for use with corresponding data files. These codebooks are searchable in the HOLLIS catalog and circulate to Harvard users for the same period as regular books. The codebooks are located in the Reading Room, next to the newspapers.
How do I associate an SPSS or SAS control card with the raw data file?
To do this in SPSS:
- Download the ASCII data file and the accompanying SPSS control card from the web.
- Uncompress the files if necessary.
- Start SPSS.
- Open the control card. Select the File, choose Open, and then choose SPSS Syntax.
- Near the top of the control card, you should see a statement like this:
DATA LIST FILE='data871.txt'.
Change the file name to the name of your data file. Use the full path name, such asDATA LIST FILE='C:\temp\da6044.txt'. - At the very bottom of the control card you should see the line
execute. If not, add it. - Save your changes to the file.
- Select the Run menu All option to execute the control card.
- Some control cards may describe the data structure without actually loading the data.
In this case select the Transform menu Run Pending Transforms option in the SPSS Data Editor.
How do I create a data definition statement with DBMSCopy?
- Launch DBMSCopy for Windows.
- Select the Interactive menu Copy Database option.
- Use the dialog box to specify the location and name of the file, and its file type. Be careful to specify the correct file type, as the file name and extention might be ambiguous. If the data were created by a spreadsheet, the variable names and data may be located anywhere in that spreadsheet. DBMSCopy might ask you to specify the data area.
- You can now choose to view a sample of the file, as DBMSCopy interprets it, to modify/filter the transformation, or to transform the data automatically. Select one (for this example, use the last option).
- Use the dialog box to specify the location, name and type of the output data.
- Load the new data in your favorite statistical program and check it to make sure that there are no obvious errors, missing variables, or missing records.
How do I download a subset of data using the IQSS Dataverse Network?
How do I extract a data subset using DBMSCopy?
- Launch DBMSCopy for Windows.
- Select the Interactives menu Copy Database option.
- DBMSCopy will display a dialog box labeled Power Panel. You can use this control to select fields and records.
- Select the Power Panel menuu Variable Information option. Then, use the keep checkboxes to keep or drop variables from the list.
- Note: If you convert data from an ASCII file, you can drop variables simply by leaving them undefined.
- Select the Power Panel menu Record Filter and Equations option. Then, build a filtering expression to select the records that you want to keep. The built-in help explains these expressions in detail. Following is a simple example of an expression that extracts only records for Alabama in 1990:
select upper(STATENAME) = "ALABAMA" and YEAR =1990;
How do I read in hierarchical data?
The example shows hierarchical data:
COUNTRY USA 1967
PERSON 1223 312 3132 3132 3132 3321 31232 3132
PERSON 1223 312 3132 3132 3132 3321 31232 3132
PERSON 1223 312 3132 3132 3132 3321 31232 3132
COUNTRY UK 1968
Each of the person records implicitly belongs the country that precedes it. But different types of records have different variables and layouts. SAS can handle layouts of this complexity, as can SPSS with programming. Stata can merge these with some effort. DBMSCopy and stat/transfer cannot use these at all.
Another approach is to use a text manipulation langage like Perl to split the files (appending identifiers for each record so that you can continue to match people to counties):
#!/usr/bin/perl
$country_count="0";
open (COUNTRYF,">country.out");
open (STATEF,">state.out");
while () {
$recid = substr($_,$RECIDSTART,$RECIDLEN);
if ($recid eq 'COUNTRY') {
$country_count++;
$country_field=substr(" ".$country_count, length $country_count-10);
print COUNTRYF $_ . $country_field;
} elsif ($recid eq 'STATE') {
print STATEF $_ . $country_field;
}
}
close (COUNTRYF);
close (STATEF);
Another approach is to split or flatten the files.
To combine all the levels of the hierachy into one long record:
#!/usr/bin/perl
while () {
$recid = substr($_,$RECIDSTART,$RECIDLEN);
if ($recid eq 'COUNTRY') {
chomp();
$country_rec = $_;
} elsif ($recid eq 'STATE') {
print $country_rec . $_ ;
}
}
The previous approach can waste a lot of space. Also, you have to remember that you did this, and take pains not to do things like sum country variables across persons, or you get overcount.
How do I read in Roper data?
In a few cases, the data are in SPSS portable format, in which case you can read it with SPSS or DBMSCopy. In most cases, however, Roper data are either in CARD-image format, or in Column-Binary.
For Roper Card-Images, use DBMSCopy to read the data. The number of cards specified in the codebook is the number of lines per record to specify in DBMSCopy. If the codebook does not specify in which card a variable belongs, assume that the codebook starts with card 1, unless otherwise labelled, and assume that all variables are in the same card, until you see a new Card subtitle in the codebook.
For instructions on Column-Binary, see How do I use EBCDIC, Column-Binary, Packed-Decimal, or other data?
How do I use an SST data file?
SST was once far ahead of its time, but is now largely abandoned. Occasionally, however you may need to convert an SST .sav file to some readable format.
Note: Most .sav files that you encounter are created by SPSS, using a completely different format. Try opening files with SPSS first. The SST conversion script on the utilities page will help you convert the sav file to tab-delimited ASCII, which can be read by any program. Unfortunately, to run the script, or convert an SST file using other means, you will need a copy of the SST program; no other program reads it. Check our supported software list for availability in our lab. More information on SST is available from the Berkeley stat lab.
How do I use EBCDIC, Column-Binary, Packed-Decimal, or other data?
Some older data sets may use the EBCDIC character set, rather than the ASCII character set. However there are many variants of this and related encodings such as OSIRIS, column-binary, packed decimal, and zoned decimal. No single tool works automatically with all of these variants. If there is not complete documentation on the precise format used, you may anticipate trial-and-error with different variants. There are a few options to convert EBCDIC to ASCII data:
- Converting OSIRIS - Stat-transfer makes this easy, although it can also be done in SAS.
- Converting EBCDIC and Column Binary - The CCRW tool, which is part of the CCOUNT package, converts from these formats to ASCII. It is freely downloadble, although not supported.
- Converting simple EBCDIC on Linux - Converting the most common variant of EBCDIC to ASCII is easy to do from a Linux account. So if you have an hmdc or fas linux account login, using your username and password and try the following:
cat inputdata.ebc | dd conv=ascii > outputdata.txt
Forinputadata.ebcdicuse the name of the original EBCDIC file and foroutputdata.txtuse what you would like the resulting plain text file to be named. This creates a file without any line-breaks, since EBCDIC files were all fixed-record-length anyway (that is, many used 80 characters as the record length). The actual record length should be in the codebook. Since some stat programs have trouble reading the data without line breaks, though, so you can add these with another command:cat outputdata.txt | fold -w RECORDLENGTH > outputdata_with_linebreaks.txt
ForRECORDLENGTHuse the record length from the codebook.
Using SAS to Read EBCDIC, Zoned-Decimal, Column-Binary, and Other Data Formats
SAS can be used to read a wide variety of obscure data formats. And it is usually not the simplest way to do so.
Reading data in SAS requires you to know the precise file and variable format. Most of the obscure formats are shown in the following table.
SAS Examples
In addition to reading using the correct informats, you may need to specify the correct recfm and lrecl parameters to infile. For column binary userecfm=F lrecl=160.
Here's an example of reading a few variables in:
DATA dataset1
infile file-specification or path-name recfm=F lrecl=160;
input @1 var1 $CB8. var2 CB2. var3 $CB5. var4 CB3. (var5 var6 var7) ($CB3.) ;
For EBCDIC data, use recfm=N for most cases, or recfm=S370V, recfm=S370VB or recfm=S370VBS for variable S370 record, variable S370 block record, and variable block spanned S270 record formats. The lrecl value should be set according to the documentation for the study record-length. Uselrecl=32768; if you need to guess. An example is:
DATA dataset2
infile file-specification or path-name recfm=S370VB lrecl=32768;
input @1 var1 $EBCDIC1. var2 S370FF2. var3 $EBCDIC8. var4 $EBCDIC1. (var5 var6 var7) ($EBCDIC6.) ;
Using SPSS for EBCDIC
Older versions of SPSS can read mainframe formats via options to the FILE HANDLE command (see FILE HANDLE and DATA LIST in the SPSS documentation). For example FILE HANDLE=MULTIPUNCH is used for column binary. However, this functionality is not in any recent versions for Windows, to our knowledge. So we recommend other tools.
Using PERL to Convert EBCDIC
The software PERL programming language can also be used to convert EBCDIC on windows, and will also handle more obscure variants. However, PERL is more difficult to install and use, and doesn't cover as many formats as SAS. Here are some useful links:
The Convert::IBM390, Convert::EBCDIC PERL module is available frrom public sources and can be used to convert packed-decimal, zoned-decimal, ibm-ebcdic and other related formats. (You need PERL installed on your system.)
You may also use SAS for zoned and packed decimal and ebcdic data. The SAS routines S370FPD, S370FZD, $VARYING, and $EBCDIC will convert, zoned decimal, packed decimal, variable-length, and EBCDIC data respectively. Be aware that there are many variations of the zoned, packed, and variable-length format, and one must use a different conversion function for each variation.
How do I use DBMSCopy to convert data formats?
- Launch DBMSCopy for Windows.
- Select the Interactives menu Copy Database option.
- Use the dialog box to specify the location and name of the file, and its file type. Be careful to specify the correct file type - as the file name and extension may be ambiguous. If the data were created by a spreadsheet, the variable names and data may be located anywhere in that spreadsheet. DBMSCopy might ask you to specify the data area.
- You can now choose to view a sample of the file, as DBMSCopy interprets it, to modify or filter the transformation, or to automatically transform the data. Select one (for this example, use the last option).
- Use the dialog box to specify the location, name and type of the output data.
- Load the new data in your favorite statistical program and check it to make sure that there are no obvious errors, missing variables, or missing records.
How do I use OSIRIS data?
The easiest way to convert OSIRIS data is with Stat/Transfer:
- Download the OSIRIS data file and the OSIRIS dictionary.
- Make sure that the dictionary file and the data file have the same name and the extensions .DICT and .DATA respectively (rename them if necessary). They should also be in the same directory.
- Run Stat/Transfer.
- Select Osiris as the Input Type.
- Enter the name of the dictionary file.
- Select the desired output format.
Alternatively, you can use SAS, with this sample code:
libname OUTPUT '/path/to/outdir';
filename DATA1 '/path/to/study.data';
filename DICTIONARY1 '/path/to/study.dict';
proc convert osiris=DATA1 dict=DICTIONARY1 out=OUTPUT.newname;
run;
exit;
Note: SAS expects that both the dictionary and the data file should be in the original EBCDIC format. In many cases, ICPSR has converted the data file to ASCII and added line feeds. To convert it back to EBCDIC, first strip the line feeds (tr -d \\n < asciidata > asciistrip), then convert it back (dd conv=ebcdic < asciistrip > ebcdcdata). SAS is available in the data center computer lab. If you need a freeware tool to handle OSIRIS formats, you may wish to examine a discussion of such tools in IASSIST Quarterly (V20, N. 2).
How do I load a very large dataset in STATA?
STATA can handle relatively large datasets with ease. However, if your dataset is truly large, it will be unable to load it and you will need a different version of STATA called STATA SE. Note that STATA handles up to 2047 variables and matrices as large as 800 x 800. STATA SE can handle datasets as large as 32,767 variables and matrices as large as 11,000 x 11,000.
If your dataset appears to be small enough for STATA to be able to open yet you are unable to do so you should try resetting the memory STATA allocates for its functions. At startup type set mem 1024m then try loading your dataset. This should do the trick. Most lab computers will allow you to set the memory this high. If your computer is unable to do so try a lower number.
Alternatively, you can tell stata to read only selected rows, row ranges, and/or variable portion of the dataset, saving memory:
use VAR1 VAR2 VAR3 if VAR1==2 in 1/1000 using DATAFILE