


Instant Data Intensive Apps with Pandas How-to (PacktPub 2013) Trent Hauck.rar
只需: 20 个论坛币 马上下载
本附件包括:
In this recipe we'll introduce the pandas DataFrame by doing some quick exercises, then move onto one of the most fundamental parts of data analysis; getting data in and out of files.
Getting readyMost of the rest of the book is working with data once it's in a pandas data structure, but this recipe is about those structures themselves and getting data in and out of them. Open your interpreter, preferably IPython.
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
Most of the file input and output in pandas is the orchestration behind the scenes of formatting the value outputs, and then writing those values to a file. There are many options for formatting file output. The to_csv method takes many parameters. Some of the more common parameters are as follows:
The following snippet writes the DataFrame df and writes it to a file called file.tsv, and it's formatted according to the parameters passed to the method.
In addition to standard file input and output functionalities, pandas has several built-in niceties.
Parsing dates at file read timeUsing Panda's sophisticated date parser, a CSV can read and parse dates at the same time, as shown in the following command line:
Besides the parsing capabilities, pandas also has a very handy date_range function, which returns a range of dates determined by the inputs. For example, it's very easy to get the months of 2012 in a series. This is shown in the following command line:
pandas can also read CSV data from the Web, assuming http://www.example.com/data.csv is the URL. Take a look at the following example:
In this recipe we'll walk through some basic functionalities about slicing pandas objects. If you're familiar with array slicing, this will be very familiar to you, but with a few idiosyncrasies for pandas.
Getting readyOpen up your interpreter, and execute the following steps:
At some level, pandas objects behave similar to NumPy arrays; they are after all abstractions built on top of them. However, because we have more metadata about the data structures we can use that to our advantage.
After the initial pandas object is created, simple slicing occurs according to the following structure:
Here column names is a string (or an array, if multiple columns) and rows is the number of rows that we wish to use.
The methods that have already been described are very useful at a higher level, but there are more granular operations available.
Direct index accessThe .ix command is an advanced method for selecting and slicing a DataFrame. Taking the sample from the preceding example, df.ix[1:3 ,[ 'one', 'two']] = 10will not only select the specified subset of the data, but also set its value equal to 10. The .xs command has a more explicit interface for working with indexes.
Often, the index of the DataFrame becomes out of alignment when slicing data. In pandas, the easiest way to reset an index is with the reset_index() method of the DataFrame object.
Indexes are not advanced because they're difficult, but if we want to be an expert with pandas it is important that we use them well. We will discuss hierarchical indexes in the following There's more... section.
Getting readyA good understanding of indexes in pandas is crucial to quickly move the data around. From a business intelligence perspective, they create a distinction similar to that of metrics and dimensions in an OLAP cube. To illustrate this point, this recipe walks through getting stock data out of pandas, combining it, then reindexing it for easy chomping.
The previous example was certainly contrived, but when indexing and statistical techniques are incorporated, the power of pandas begins to come through. Statistics will be covered in an upcoming recipe.
pandas' indexes by themselves can be thought of as descriptors of a certain point in the DataFrame. When ticker and timestamp are the only indexes in a DataFrame, then the point is individualized by the ticker, timestamp, and column name. After the point is individualized, it's more convenient for aggregation and analysis.
Indexes show up all over the place in pandas so it's worthwhile to see some other use cases as well.
Advanced header indexesHierarchical indexing isn't limited to rows. Headers can also be represented by MultiIndex, as shown in the following command line:
As a prelude to the following sections, we'll do a single groupby function here since they work with indexes so well.
This answers the question for each ticker and for each day (not date), that is, what was the mean volume over the life of the data.
In this recipe we'll talk about working with dates in pandas. Because pandas was initially written with financial time series, it has a lot of out of the box date functionalities.
Getting readyOpen up your interpreter and follow the command progression in the following section. Difficult financial analysis was the mother of pandas creation; therefore, it has many efficient and easy ways for dealing with dates.
The date_range function is defined by dates and frequencies. See the following section for the various frequency designations. The easiest way is to define a start date, end date, and frequency, but there are other ways as well. You can also change the frequency, or resample to a smaller or larger time interval.
pandas adds a lot more functionalities to handle dates. These are mostly convenient methods because working with dates is a necessary evil of data analysis.
Alternative date range specificationTime series in pandas don't have to be defined by a start and end date. In pandas, it is possible to represent the time of the Series as an interval of dates with a common period between data points. For example, if we want to create a Series just like Y2K, we can do so as follows:
pandas offers the ability to move up and down the granularity of a time series. For example, given a Series of random numbers s for all the days in 2012, calculating the sum for each month is done by the following formula:
In the preceding example, the 'M' variable specifies that we're upsampling to month. Downsampling is also done in a similar way; however, pandas provides functionalities for handling the disaggregation in a convenient way.
In this recipe we'll walk through the process of applying a function to a DataFrame. This is a simple but very important part of data analysis. Rarely, if ever, will a data in raw form be sufficient for data analysis. Often, that data needs to be transformed into some other form, and to do that you'll need to apply functions to pandas objects.
Getting readyOpen up your interpreter, and type the following commands successively.
pandas sits on top of NumPy; thus pandas takes advantage of the broadcasting capabilities inherent within NumPy. For example, execute the following script to see the differences in NumPy:
Understanding the underlying NumPy structure is beyond the scope, but is extremely helpful in the long run.
pandas makes additional use of the apply function in place of the for loop function. Quite often it's necessary to do more complex operations on an entire column(s) of a DataFrame, but broadcasting or looping won't cut it.
Other apply optionsThere are other apply functions in the family. For example, the applymap function operates in a slightly different manner than the apply function. The applymapfunction operates on a single value and returns a single value, whereas the apply function takes an array-like data structure as an input.
Functions can also be applied iteratively; however, this tends to make the functions slow and leads to unnecessarily verbose code.
Given that we have several different types of DataFrames, how can we best join them into one DataFrame for additional use? We'll also talk about merging and appending them in the following There's more... section.
Getting readyOpen up your interpreter and type the following given commands successively. Very rarely will an analyst receive data in a single flat file. Quite often, data will need to be either appended to the bottom of the DataFrame or attached to the side. For example, if a set of data comes directly from a normalized database, the analyst will need to combine them by joining them using Primary and Foreign Keys.
If the reader is familiar with R's functionalities, then he/she can see that joining data in pandas is not much different than in R. We'll cover more on indexes later, but thinking of the default index as a Primary Key, or the combination of hierarchical index as a Composite Key, elucidates the joining process.
There are many options that can be supplied to the merge and join methods to modify the DataFrames' behaviour.
Merge and join detailsThe merge (and join) method uses a how parameter, which is a string of the join database. The possible values are 'left', 'right', 'outer', and 'inner'.
The join function (not the previously mentioned one) is easy to use to join DataFrames.
Use suffixes to disambiguate columns if the DataFrames have similar column names. join defaults to joining of indexes, but the on parameter can be used to specify a column. For example:
One way to join the datasets is to just stack them on top of each other. This is similar to a union command in SQL. Given two DataFrames, One and Two, a concatenation is done in the following way:
A list can also be used. Although it will be awkward for two DataFrames, it makes much more sense in the event of 50 DataFrames.
扫码加好友,拉您进群



收藏
