面向数据科学的 Pandas 简介

5 分钟

你现在已经了解了 NumPy，接下来可了解 Python 中数据科学的其他主力工具：Pandas。 Python 中的 Pandas 库使处理数据（如导入、清理和整理数据）变得更加容易。在 Python 中，很难想象在没有 Pandas 库的情况下进行数据科学。

但情况并非总是如此。 Wes McKinney 待在 AQR Capital Management 的时候，出于需要于 2008 年开发了这个库，目的是提供更好的工具来处理数据分析。由于是一种开源软件项目，该库已成为数据科学生态系统中发展完备且不可或缺的一部分。（事实上，本模块中的一些示例取自 McKinney 的书籍：《利用 Python 进行数据分析》。）

Pandas 这个词实际上与大熊猫无关，而是来自术语 panel data（面板数据）。面板数据是多维数据的一种形式，涉及到一段时间内的测量值，它出自计量经济学和统计学领域。具有讽刺意味的是，尽管面板数据在 Pandas 中是一种可用的数据结构，但目前并不常用，我们在这里也不进行讨论。相反，我们将重点介绍 Pandas 中运用最广泛的两个数据结构：Series 和 DataFrames。

有关导入和文档的提醒

正如你导入别名为 np 的 NumPy 一样，会在别名 pd 下导入 Pandas。请先确保已安装 Pandas（在以下终端中运行 pip install pandas）。

import pandas as pd

与 NumPy 约定一样，pd 是在数据科学领域中非常重要且得到广泛运用的一项约定。建议你在自己的编码中使用它。

在学习本模块时，请不要忘记 IPython 提供了 Tab 补全功能和函数文档（带有 ? 字符）。如果对此模块中看到的函数有任何疑惑，请花些时间阅读该文档。该文档非常有用。提醒一下，要显示内置的 Pandas 文档，请使用以下代码：

ipython
pd?

了解 Pandas 中的 Series 和 DataFrames 会很有用，它是 NumPy 中 ndarrays 的扩展，因此请继续操作并导入 NumPy。在后面的一些示例中，你将需要它：

import numpy as np

输出如下：

Type:        module
String form: 
File:        /opt/anaconda3/lib/python3.7/site-packages/pandas/__init__.py
Docstring:  
pandas - a powerful data analysis and manipulation library for Python
=====================================================================

**pandas** is a Python package providing fast, flexible, and expressive data
structures designed to make working with "relational" or "labeled" data both
easy and intuitive. It aims to be the fundamental high-level building block for
doing practical, **real world** data analysis in Python. Additionally, it has
the broader goal of becoming **the most powerful and flexible open source data
analysis / manipulation tool available in any language**. It is already well on
its way toward this goal.

Main Features
-------------
Here are just a few of the things that pandas does well:

  - Easy handling of missing data in floating point as well as non-floating
    point data.
  - Size mutability: columns can be inserted and deleted from DataFrame and
    higher dimensional objects
  - Automatic and explicit data alignment: objects can be explicitly aligned
    to a set of labels, or the user can simply ignore the labels and let
    `Series`, `DataFrame`, etc. automatically align the data for you in
    computations.
  - Powerful, flexible group by functionality to perform split-apply-combine
    operations on data sets, for both aggregating and transforming data.

学习目标

在本模块中，你将：

在 Visual Studio Code 中将 Pandas 库导入 Jupyter Notebook
了解如何使用 Series 和 DataFrames 来存储远程数据
了解如何清理和操作大型远程数据集
操作 Series 和 DataFrames 来进行数据科学分析

先决条件

设置环境

建议设置你的环境，以便可以在此模块中继续并有效地学习。

完成以下步骤来设置环境：

下载并安装 Visual Studio Code。此工具免费，适用于 Windows、Mac 和 Linux。选择适合你的平台的稳定版本。
下载并安装适用于 Visual Studio Code 的 Python 扩展。此操作的第一步是安装受支持的 Python 版本。
激活 Anaconda 环境，以便能够运行 Jupyter Notebook。
设置数据科学环境，以便能够使用 NumPy 和 Pandas。

测试环境

如果已使用 VS Code、Python、Anaconda 以及 NumPy 和 Pandas 库成功设置环境，则应该能够在 VS Code 内运行 Jupyter Notebook。

克隆 Reactor 存储库并在 VS Code 中打开与此模块对应的文件夹。
运行 Test-Setup-Config.ipynb 文件，确保已准备好继续完成本模块。

完成此学习模块

学习本模块时，鼓励试用代码。使用克隆的文件执行此操作。

Jupyter Notebook 分成多个单元格。每个单元格都包含使用 Markdown 标记语言编写的文本或用于写入和执行计算机代码的空间。由于所有代码都在代码单元格内，因此可以内联运行每个代码单元格，而不是使用单独的 Python 交互式窗口。

注意

本学习模块可让你逐一运行各个代码单元格。完成这些模块时，建议将代码片段复制到 VS Code Jupyter Notebook，并一次运行一个单元格。

反馈

此页面是否有帮助？