Basic SELECT Operations

Topics

  • Displaying the entire dataset
  • Selecting specific columns from a dataset
  • Calculations
  • Concatenations
  • Adding new columns
  • Renaming existing columns
  • Retreiving distinct list of values
  • Creating a simple DataFrame from a Python list
  • Getting info on your DataFrame

We start by importing the Pandas library

In [3]:
import pandas as pd

Next we use the read_csv method in order to read the contents of a csv file into a variable, in this example we're calling the variable products.
The ../../data/products.csv path is simply where products.csv is kept.

In [9]:
products=pd.read_csv("../../data/products.csv")

Selecting the entire dataset

In order to display the entire DataFrame contents, simply type its name

In [10]:
products
Out[10]:
ProductID ProductName UnitPrice CategoryID
0 1 Chai 18.00 1
1 2 Chang 19.00 1
2 3 Aniseed Syrup 10.00 2
3 4 Chef Anton's Cajun Seasoning 22.00 2
4 5 Chef Anton's Gumbo Mix 21.35 2
5 6 Grandma's Boysenberry Spread 25.00 2
6 7 Uncle Bob's Organic Dried Pears 30.00 7
7 8 Northwoods Cranberry Sauce 40.00 2
8 9 Mishi Kobe Niku 97.00 6
9 10 Ikura 31.00 8
10 11 Queso Cabrales 21.00 4
11 12 Queso Manchego La Pastora 38.00 4
12 13 Konbu 6.00 8
13 14 Tofu 23.25 7
14 15 Genen Shouyu 15.50 2

Combine it with :

  • head() to display the first 5 rows
  • tail() to display the last 5 rows
In [11]:
products.head()
Out[11]:
ProductID ProductName UnitPrice CategoryID
0 1 Chai 18.00 1
1 2 Chang 19.00 1
2 3 Aniseed Syrup 10.00 2
3 4 Chef Anton's Cajun Seasoning 22.00 2
4 5 Chef Anton's Gumbo Mix 21.35 2
In [12]:
products.tail()
Out[12]:
ProductID ProductName UnitPrice CategoryID
10 11 Queso Cabrales 21.00 4
11 12 Queso Manchego La Pastora 38.00 4
12 13 Konbu 6.00 8
13 14 Tofu 23.25 7
14 15 Genen Shouyu 15.50 2

You may also specify a number within the brackets to control how many rows will be displayed

In [6]:
products.head(3)
Out[6]:
ProductID ProductName UnitPrice CategoryID
0 1 Chai 18.0 1
1 2 Chang 19.0 1
2 3 Aniseed Syrup 10.0 2
In [13]:
products.tail(2)
Out[13]:
ProductID ProductName UnitPrice CategoryID
13 14 Tofu 23.25 7
14 15 Genen Shouyu 15.50 2

Selecting specific columns

  • Instead of selecting the entire dataset, in order to specify a list of columns, indicate their names within a square brackets
  • Column names are case sensitive
In [7]:
products["ProductName"].head()
Out[7]:
0                            Chai
1                           Chang
2                   Aniseed Syrup
3    Chef Anton's Cajun Seasoning
4          Chef Anton's Gumbo Mix
Name: ProductName, dtype: object

Specifying only one column returns a Series datatype, you can change that by using the to_frame() method

In [8]:
products["ProductName"].to_frame().head()
Out[8]:
ProductName
0 Chai
1 Chang
2 Aniseed Syrup
3 Chef Anton's Cajun Seasoning
4 Chef Anton's Gumbo Mix

Specifying multiple columns on the other hand, returns a dataframe.

In [9]:
products[["ProductID", "ProductName"]].head()
Out[9]:
ProductID ProductName
0 1 Chai
1 2 Chang
2 3 Aniseed Syrup
3 4 Chef Anton's Cajun Seasoning
4 5 Chef Anton's Gumbo Mix

Please note - when you're specifying more than one column, you need to use double square brackets.
The inner square brackets is a Python list allowing us to specify multiple values

Calculations

You may not always want to retrieve the data as is. In some cases, you may want to display your data with certain calculation, or look at a what-if scenarios (for example, you may want to know how the prices would look like after adding VAT).

Adding Values

In [10]:
products["UnitPrice"].add(5).head()
Out[10]:
0    23.00
1    24.00
2    15.00
3    27.00
4    26.35
Name: UnitPrice, dtype: float64
In [11]:
(products["UnitPrice"] + 5).head()
Out[11]:
0    23.00
1    24.00
2    15.00
3    27.00
4    26.35
Name: UnitPrice, dtype: float64

Substracting values

In [12]:
products["UnitPrice"].sub(5).head()
Out[12]:
0    13.00
1    14.00
2     5.00
3    17.00
4    16.35
Name: UnitPrice, dtype: float64
In [13]:
(products["UnitPrice"] - 5 ).head()
Out[13]:
0    13.00
1    14.00
2     5.00
3    17.00
4    16.35
Name: UnitPrice, dtype: float64

Dividing values

In [14]:
products["UnitPrice"].div(10).head()
Out[14]:
0    1.800
1    1.900
2    1.000
3    2.200
4    2.135
Name: UnitPrice, dtype: float64
In [15]:
(products["UnitPrice"] / 10 ).head()
Out[15]:
0    1.800
1    1.900
2    1.000
3    2.200
4    2.135
Name: UnitPrice, dtype: float64

Multiplying

In [16]:
products["UnitPrice"].mul(1.17).head()
Out[16]:
0    21.0600
1    22.2300
2    11.7000
3    25.7400
4    24.9795
Name: UnitPrice, dtype: float64
In [17]:
(products["UnitPrice"] * 1.17).head()
Out[17]:
0    21.0600
1    22.2300
2    11.7000
3    25.7400
4    24.9795
Name: UnitPrice, dtype: float64

Concatenations

  • To concatenate strings, simple use the "+" sign
  • Remember - Specifying only one column returns a Series datatype, you can always change that by using the to_frame() method
In [18]:
(products["ProductName"] + " - some string ").head()
Out[18]:
0                            Chai - some string 
1                           Chang - some string 
2                   Aniseed Syrup - some string 
3    Chef Anton's Cajun Seasoning - some string 
4          Chef Anton's Gumbo Mix - some string 
Name: ProductName, dtype: object

To concatenate string and numeric/data, convert the numeric/data to a string using the .astype(str) method

In [19]:
(products["ProductName"] + products["UnitPrice"].astype(str)).head()
Out[19]:
0                            Chai18.0
1                           Chang19.0
2                   Aniseed Syrup10.0
3    Chef Anton's Cajun Seasoning22.0
4         Chef Anton's Gumbo Mix21.35
dtype: object

Another example :

In [20]:
(products["ProductName"] + " - " + products["UnitPrice"].astype(str)).to_frame().head()
Out[20]:
0
0 Chai - 18.0
1 Chang - 19.0
2 Aniseed Syrup - 10.0
3 Chef Anton's Cajun Seasoning - 22.0
4 Chef Anton's Gumbo Mix - 21.35

Adding New Columns

Method #1

  • One way is to simply specify the new column, and assign it a value.
  • Using the method will add a new column into the dataframe
In [21]:
products["UnitPriceVAT"] = products["UnitPrice"].mul(1.17)
products.head()
Out[21]:
ProductID ProductName UnitPrice CategoryID UnitPriceVAT
0 1 Chai 18.00 1 21.0600
1 2 Chang 19.00 1 22.2300
2 3 Aniseed Syrup 10.00 2 11.7000
3 4 Chef Anton's Cajun Seasoning 22.00 2 25.7400
4 5 Chef Anton's Gumbo Mix 21.35 2 24.9795

Method #2

  • Another way is using the insert() command
  • Using the insert() allows you to determine where the new column will be located
  • And again, using the method will add a new column into the dataframe
In [22]:
products.insert(loc = 2, column = "UnitPriceDiscount", value = products["UnitPrice"] * 0.9)
products.head()
Out[22]:
ProductID ProductName UnitPriceDiscount UnitPrice CategoryID UnitPriceVAT
0 1 Chai 16.200 18.00 1 21.0600
1 2 Chang 17.100 19.00 1 22.2300
2 3 Aniseed Syrup 9.000 10.00 2 11.7000
3 4 Chef Anton's Cajun Seasoning 19.800 22.00 2 25.7400
4 5 Chef Anton's Gumbo Mix 19.215 21.35 2 24.9795

Another example :

In [23]:
products.insert(loc = 0, 
                column = "NewPriceDescriptor", 
                value = products["ProductName"] + " - " + products["UnitPrice"].astype(str))
products.head()
Out[23]:
NewPriceDescriptor ProductID ProductName UnitPriceDiscount UnitPrice CategoryID UnitPriceVAT
0 Chai - 18.0 1 Chai 16.200 18.00 1 21.0600
1 Chang - 19.0 2 Chang 17.100 19.00 1 22.2300
2 Aniseed Syrup - 10.0 3 Aniseed Syrup 9.000 10.00 2 11.7000
3 Chef Anton's Cajun Seasoning - 22.0 4 Chef Anton's Cajun Seasoning 19.800 22.00 2 25.7400
4 Chef Anton's Gumbo Mix - 21.35 5 Chef Anton's Gumbo Mix 19.215 21.35 2 24.9795

Method #3

  • Another way is to use the assign() method
  • This method assigns new columns to a DataFrame, returning a new object (a copy) with all the original columns in addition to the new ones
  • This method does not add a new column into the DataFrame, instead, assign() displays the new column only for the current execution
In [24]:
products[["ProductName", "UnitPrice"]].assign(PriceVAT999 = products["UnitPrice"].mul(1.17)).head()
Out[24]:
ProductName UnitPrice PriceVAT999
0 Chai 18.00 21.0600
1 Chang 19.00 22.2300
2 Aniseed Syrup 10.00 11.7000
3 Chef Anton's Cajun Seasoning 22.00 25.7400
4 Chef Anton's Gumbo Mix 21.35 24.9795

assign() may take multiple values :

In [25]:
products[["ProductName", "UnitPrice"]]\
.assign(PriceVAT = products["UnitPrice"].mul(1.17), 
        PriceDISCOUNT = products["UnitPrice"].mul(0.83)).head()
Out[25]:
ProductName UnitPrice PriceVAT PriceDISCOUNT
0 Chai 18.00 21.0600 14.9400
1 Chang 19.00 22.2300 15.7700
2 Aniseed Syrup 10.00 11.7000 8.3000
3 Chef Anton's Cajun Seasoning 22.00 25.7400 18.2600
4 Chef Anton's Gumbo Mix 21.35 24.9795 17.7205

Renaming Existing Columns

We'll start this demonstration by resetting the dataset :

In [15]:
products=pd.read_csv("../../data/products.csv")
products.head()
Out[15]:
ProductID ProductName UnitPrice CategoryID
0 1 Chai 18.00 1
1 2 Chang 19.00 1
2 3 Aniseed Syrup 10.00 2
3 4 Chef Anton's Cajun Seasoning 22.00 2
4 5 Chef Anton's Gumbo Mix 21.35 2

Method #1

We can use the columns attribute in order to rename all columns. Using this method, the change will be permanent

In [17]:
# products.columns 
products.columns = ["PID","PNAME","UPC","CID"]
products.head()
Out[17]:
PID PNAME UPC CID
0 1 Chai 18.00 1
1 2 Chang 19.00 1
2 3 Aniseed Syrup 10.00 2
3 4 Chef Anton's Cajun Seasoning 22.00 2
4 5 Chef Anton's Gumbo Mix 21.35 2

Method #2 :

Alternitavly we can use the rename() method. As long as we are not using inplace=True arguemt, the change will not be permanent

In [18]:
products.rename(columns = {"PNAME":"ProductName"}).head(2)
Out[18]:
PID ProductName UPC CID
0 1 Chai 18.0 1
1 2 Chang 19.0 1

You can see that the name of the second column remained PNAME

In [19]:
products.head(2)
Out[19]:
PID PNAME UPC CID
0 1 Chai 18.0 1
1 2 Chang 19.0 1

Using the inplace=True arguemt on the other hand, makes the change permanent

In [20]:
products.rename(columns = {"PNAME":"ProductName"}, inplace=True)
In [21]:
products.head()
Out[21]:
PID ProductName UPC CID
0 1 Chai 18.00 1
1 2 Chang 19.00 1
2 3 Aniseed Syrup 10.00 2
3 4 Chef Anton's Cajun Seasoning 22.00 2
4 5 Chef Anton's Gumbo Mix 21.35 2

DISTINCT

We'll start this demonstration by resetting the dataset

In [22]:
products=pd.read_csv("../../data/products.csv")

Note the values of products["CategoryID"] which contains many duplicate rows

In [23]:
products["CategoryID"].head(10)
Out[23]:
0    1
1    1
2    2
3    2
4    2
5    2
6    7
7    2
8    6
9    8
Name: CategoryID, dtype: int64

Method #1

To return a distinct list of value we can use the unique() method. This method may be operated only upon Series

In [24]:
products["CategoryID"].unique()
Out[24]:
array([1, 2, 7, 6, 8, 4], dtype=int64)

Method #2

To return a distinct list of values we can also use the drop_duplicates() method.

In [36]:
products["CategoryID"].drop_duplicates().to_frame().head()
Out[36]:
CategoryID
0 1
2 2
6 7
8 6
9 8

In contrast to unique(), drop_duplicates() may be operated upon multiple values as well

In [26]:
(products[["CategoryID","UnitPrice"]].drop_duplicates()).head()
Out[26]:
CategoryID UnitPrice
0 1 18.00
1 1 19.00
2 2 10.00
3 2 22.00
4 2 21.35

Creating a Simple DataFrame from Lists

  • There are multiple methods you can use in order to take a standard python data structure and use it for creating a panda’s DataFrame.
  • Here's one easy option :
In [38]:
sales = [['Jones LLC', 150, 200, 50],
         ['Alpha Co', 200, 210, 90],
         ['Blue Inc', 140, 215, 95]]
labels = ['account', 'Jan', 'Feb', 'Mar']
df = pd.DataFrame.from_records(sales, columns=labels)
df
Out[38]:
account Jan Feb Mar
0 Jones LLC 150 200 50
1 Alpha Co 200 210 90
2 Blue Inc 140 215 95

Getting info on your DataFrame

Shape

shape returns a tuple representing the dimensionality of the DataFrame (products has 15 rows and 4 columns).

In [28]:
products.shape
Out[28]:
(15, 4)

Data Types

dtypes returns the data type of each column in the DataFrame

In [29]:
products.dtypes
Out[29]:
ProductID        int64
ProductName     object
UnitPrice      float64
CategoryID       int64
dtype: object

General Information

info() prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage

In [30]:
products.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 4 columns):
ProductID      15 non-null int64
ProductName    15 non-null object
UnitPrice      15 non-null float64
CategoryID     15 non-null int64
dtypes: float64(1), int64(2), object(1)
memory usage: 560.0+ bytes