07-12-2012, 04:52 PM
R Data Import
R-data.pdf (Size: 379.99 KB / Downloads: 18)
Introduction
Reading data into a statistical system for analysis and exporting the results to some other
system for report writing can be frustrating tasks that can take far more time than the
statistical analysis itself, even though most readers will find the latter far more appealing.
This manual describes the import and export facilities available either in R itself or via
packages which are available from CRAN or elsewhere.
Unless otherwise stated, everything described in this manual is (at least in principle)
available on all platforms running R.
In general, statistical systems like R are not particularly well suited to manipulations
of large-scale data. Some other systems are better than R at this, and part of the thrust
of this manual is to suggest that rather than duplicating functionality in R we can make
another system do the work! (For example Therneau & Grambsch (2000) commented that
they preferred to do data manipulation in SAS and then use package survival in S for the
analysis.) Database manipulation systems are often very suitable for manipulating and
extracting data: several packages to interact with DBMSs are discussed here.
There are packages to allow functionality developed in languages such as Java, perl and
python to be directly integrated with R code, making the use of facilities in these languages
even more appropriate. (See the rJava package from CRAN and the SJava, RSPerl and
RSPython packages from the Omegahat project,
It is also worth remembering that R like S comes from the Unix tradition of small reusable
tools, and it can be rewarding to use tools such as awk and perl to manipulate
data before import or after export. The case study in Becker, Chambers & Wilks (1988,
Chapter 9) is an example of this, where Unix tools were used to check and manipulate the
data before input to S. The traditional Unix tools are now much more widely available,
including for Windows.
Imports
The easiest form of data to import into R is a simple text file, and this will often be
acceptable for problems of small or medium scale. The primary function to import from
a text file is scan, and this underlies most of the more convenient functions discussed in
Chapter 2 [Spreadsheet-like data], page 7.
However, all statistical consultants are familiar with being presented by a client with a
memory stick (formerly, a floppy disc or CD-R) of data in some proprietary binary format,
for example ‘an Excel spreadsheet’ or ‘an SPSS file’. Often the simplest thing to do is to
use the originating application to export the data as a text file (and statistical consultants
will have copies of the most common applications on their computers for that purpose).
However, this is not always possible, and Chapter 3 [Importing from other statistical systems],
page 14 discusses what facilities are available to access such files directly from R.
For Excel spreadsheets, the available methods are summarized in Chapter 8 [Reading Excel
spreadsheets], page 30. For ODS spreadsheets from Open Office, see the Omegahat package1
ROpenOffice.
Encodings
Unless the file to be imported from is entirely in ASCII, it is usually necessary to know how
it was encoded. For text files, a good way to find out something about its structure is the
file command-line tool (for Windows, included in Rtools). This reports something like
text.Rd: UTF-8 Unicode English text
text2.dat: ISO-8859 English text
text3.dat: Little-endian UTF-16 Unicode English character data,
with CRLF line terminators
intro.dat: UTF-8 Unicode text
intro.dat: UTF-8 Unicode (with BOM) text
Modern Unix-alike systems, including Mac OS X, are likely to produce UTF-8 files. Windows
may produce what it calls ‘Unicode’ files (UCS-2LE or just possibly UTF-16LE2). Otherwise
most files will be in a 8-bit encoding unless from a Chinese/Japanese/Korean locale
(which have a wide range of encodings in common use). It is not possible to automatically
detect with certainty which 8-bit encoding (although guesses may be possible and file may
guess as it did in the example above), so you may simply have to ask the originator for
some clues (e.g. ‘Russian on Windows’).
‘BOMs’ (Byte Order Marks,
cause problems for Unicode files. In the Unix world BOMs are rarely used, whereas in the
Windows world they almost always are for UCS-2/UTF-16 files, and often are for UTF-8
files. The file utility will not even recognize UCS-2 files without a BOM, but many other
utilities will refuse to read files with a BOM and the IANA standards for UTF-16LE and
UTF-16BE prohibit it. We have too often been reduced to looking at the file with the
command-line utility od or a hex editor to work out its encoding.
Spreadsheet-like data
In Section 1.2 [Export to text files], page 3 we saw a number of variations on the format of
a spreadsheet-like text file, in which the data are presented in a rectangular grid, possibly
with row and column labels. In this section we consider importing such files into R.
Variations on read.table
The function read.table is the most convenient way to read in a rectangular grid of data.
Because of the many possibilities, there are several other functions that call read.table
but change a group of default arguments.
Beware that read.table is an inefficient way to read in very large numerical matrices:
see scan below.