Python Tutorials part III Pandas Tutorial

9 minute read

01 Oct,2019

By: Vishnu Prakash Singh

from IPython.display import Image;from datetime import date
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Pandas Series

Pandas Series is a one-dimensional labeled array capable of holding data of any type
The axis labels are collectively called index.
Labels need not be unique but must be a hashable type.

import pandas as pd
data = pd.Series(np.random.randint(10, 20,6), index=pd.date_range('20130101', periods=6))
print('Type of the series is ' + str(type(data)))

Type of the series is <class ‘pandas.core.series.Series’>

print('Index of the series is ' + str(data.index))

Index of the series is DatetimeIndex([‘2013-01-01’, ‘2013-01-02’, ‘2013-01-03’, ‘2013-01-04’,’2013-01-05’, ‘2013-01-06’],dtype=’datetime64[ns]’, freq=’D’)

print('Values of the series is ' + str(data.values))

Values of the series is [12 14 18 12 13 10]

print('Data Type of the series is ' + str(data.dtype))

Data Type of the series is int32

Pandas DataFrame

np.random.seed(47)
df = pd.DataFrame({'ID': map(str,np.arange(103,120,3)),   # map function maps str function to each item of list
                     'DATE': pd.date_range('20190101', periods=6),
                     'VALUE': np.round(np.random.uniform(1000, 5000,6),2),
                     'VOLUME': np.random.randint(1, 20,6),
                     'SET': np.random.choice(["test", "train"],6)})
df.shape
df

(6, 5)

	ID	DATE	VALUE	VOLUME	SET
0	103	2019-01-01	1453.95	14	train
1	106	2019-01-02	4897.93	18	train
2	109	2019-01-03	3914.94	19	test
3	112	2019-01-04	2405.87	19	test
4	115	2019-01-05	3830.42	12	train
5	118	2019-01-06	4198.42	2	test

Checking data type of all columns

df.dtypes

ID                object
DATE      datetime64[ns]
VALUE            float64
VOLUME             int32
SET               object
dtype: object

info() returns various details of the dataframe

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 5 columns):
ID        6 non-null object
DATE      6 non-null datetime64[ns]
VALUE     6 non-null float64
VOLUME    6 non-null int32
SET       6 non-null object
dtypes: datetime64[ns](1), float64(1), int32(1), object(2)
memory usage: 344.0+ bytes

describe() shows a quick statistic summary of your data:

df.describe()   # only takes numeric columns under consideration

	VALUE	VOLUME
count	6.000000	6.000000
mean	3450.255000	14.000000
std	1272.159191	6.542171
min	1453.950000	2.000000
25%	2762.007500	12.500000
50%	3872.680000	16.000000
75%	4127.550000	18.750000
max	4897.930000	19.000000

Viewing pandas data

df.head(2)
df.tail(2)

	ID	DATE	VALUE	VOLUME	SET
0	103	2019-01-01	1453.95	14	train
1	106	2019-01-02	4897.93	18	train

	ID	DATE	VALUE	VOLUME	SET
4	115	2019-01-05	3830.42	12	train
5	118	2019-01-06	4198.42	2	test

Display the index, columns:

df.index.to_list() # converting index to list for better printing
df.columns

[0, 1, 2, 3, 4, 5]

Index([‘ID’, ‘DATE’, ‘VALUE’, ‘VOLUME’, ‘SET’], dtype=’object’)

pandas to numpy

Note DataFrame.to_numpy() does not include the index or column labels in the output.

df.to_numpy()

array([[‘103’, Timestamp(‘2019-01-01 00:00:00’), 1453.95, 14, ‘train’], [‘106’, Timestamp(‘2019-01-02 00:00:00’), 4897.93, 18, ‘train’], [‘109’, Timestamp(‘2019-01-03 00:00:00’), 3914.94, 19, ‘test’], [‘112’, Timestamp(‘2019-01-04 00:00:00’), 2405.87, 19, ‘test’], [‘115’, Timestamp(‘2019-01-05 00:00:00’), 3830.42, 12, ‘train’], [‘118’, Timestamp(‘2019-01-06 00:00:00’), 4198.42, 2, ‘test’]], dtype=object)

Converting datatype of variable

df.dtypes

df['VOLUME'] = df.VOLUME.astype(str)  # converting from integer to string
df.VOLUME.dtype

df['ID'] = df.ID.astype('category')   # converting from string to factor
df.ID.dtype

df['VOLUME'] = df.VOLUME.astype(int)   # converting from string to integer
df.VOLUME.dtype

ID                object
DATE      datetime64[ns]
VALUE            float64
VOLUME             int32
SET               object
dtype: object

dtype(‘O’)

CategoricalDtype(categories=[‘103’, ‘106’, ‘109’, ‘112’, ‘115’, ‘118’], ordered=False)

dtype(‘int32’)

Transposing your data:

df.T.shape  # rows and columns got interchanged
df.T

(5, 6)

	0	1	2	3	4	5
ID	103	106	109	112	115	118
DATE	2019-01-01 00:00:00	2019-01-02 00:00:00	2019-01-03 00:00:00	2019-01-04 00:00:00	2019-01-05 00:00:00	2019-01-06 00:00:00
VALUE	1453.95	4897.93	3914.94	2405.87	3830.42	4198.42
VOLUME	14	18	19	19	12	2
SET	train	train	test	test	train	test

Sorting by an axis:

axis 0 means row wise operation
axis 1 means column wise operation

# sorting by the row index
df.sort_index(axis=0, ascending=False)

	ID	DATE	VALUE	VOLUME	SET
5	118	2019-01-06	4198.42	2	test
4	115	2019-01-05	3830.42	12	train
3	112	2019-01-04	2405.87	19	test
2	109	2019-01-03	3914.94	19	test
1	106	2019-01-02	4897.93	18	train
0	103	2019-01-01	1453.95	14	train

# sorting by the column index
df.sort_index(axis=1, ascending=False)

	VOLUME	VALUE	SET	ID	DATE
0	14	1453.95	train	103	2019-01-01
1	18	4897.93	train	106	2019-01-02
2	19	3914.94	test	109	2019-01-03
3	19	2405.87	test	112	2019-01-04
4	12	3830.42	train	115	2019-01-05
5	2	4198.42	test	118	2019-01-06

# sorting by a column
df.sort_values(by='VOLUME', ascending = False)

	ID	DATE	VALUE	VOLUME	SET
2	109	2019-01-03	3914.94	19	test
3	112	2019-01-04	2405.87	19	test
1	106	2019-01-02	4897.93	18	train
0	103	2019-01-01	1453.95	14	train
4	115	2019-01-05	3830.42	12	train
5	118	2019-01-06	4198.42	2	test

Selection of subset of data

#Selecting a single column, which yields a Series, equivalent to df.ID
df['ID']  # returns pandas.core.series.Series

  103
  106
  109
  112
  115
  118
Name: ID, dtype: category
Categories (6, object): [103, 106, 109, 112, 115, 118]

#Selecting via [], which slices the rows.
df[0:3]

	ID	DATE	VALUE	VOLUME	SET
0	103	2019-01-01	1453.95	14	train
1	106	2019-01-02	4897.93	18	train
2	109	2019-01-03	3914.94	19	test

Selection by label using loc

df.loc[:,['VALUE']] # returns pandas.core.frame.DataFrame

	VALUE
0	1453.95
1	4897.93
2	3914.94
3	2405.87
4	3830.42
5	4198.42

df.loc[0:1]

	ID	DATE	VALUE	VOLUME	SET
0	103	2019-01-01	1453.95	14	train
1	106	2019-01-02	4897.93	18	train

df.loc[1:2:, ['ID', 'VALUE']]

	ID	VALUE
1	106	4897.93
2	109	3914.94

Selection by position using iloc

df.iloc[3]

ID                        112
DATE      2019-01-04 00:00:00
VALUE                 2405.87
VOLUME                     19
SET                      test
Name: 3, dtype: object

df.iloc[[1, 2, 4], [0, 2]]

	ID	VALUE
1	106	4897.93
2	109	3914.94
4	115	3830.42

Boolean indexing

df[df.VALUE > 3500]

	ID	DATE	VALUE	VOLUME	SET
1	106	2019-01-02	4897.93	18	train
2	109	2019-01-03	3914.94	19	test
4	115	2019-01-05	3830.42	12	train
5	118	2019-01-06	4198.42	2	test

df[df.SET.isin(['train'])]  # df[df.SET=='train']

	ID	DATE	VALUE	VOLUME	SET
0	103	2019-01-01	1453.95	14	train
1	106	2019-01-02	4897.93	18	train
4	115	2019-01-05	3830.42	12	train

df[df.SET=='train']

	ID	DATE	VALUE	VOLUME	SET
0	103	2019-01-01	1453.95	14	train
1	106	2019-01-02	4897.93	18	train
4	115	2019-01-05	3830.42	12	train

Missing data

df1 = df.copy() # creates a copy of dataframe
df1.loc[2,'DATE'] = np.NaN   # replacing value with NA
df1.loc[2:3,'VALUE'] = np.NaN # replacing value with NA
df1.loc[5,'SET'] = np.NaN # replacing value with NA

# checking counts of missing values
df1.VALUE.isna().sum()
df1.isna().sum()

ID        0
DATE      1
VALUE     2
VOLUME    0
SET       1
dtype: int64

#To drop any rows that have missing data.
df1.dropna(how='any')

	ID	DATE	VALUE	VOLUME	SET
0	103	2019-01-01	1453.95	14	train
1	106	2019-01-02	4897.93	18	train
4	115	2019-01-05	3830.42	12	train

#To drop a row specific to a column that have missing data.
df1.dropna(subset = ['SET'],inplace = True) 
# when inplace = True, the output is stored in the same dataframe on which operation is done
df1

	ID	DATE	VALUE	VOLUME	SET
0	103	2019-01-01	1453.95	14	train
1	106	2019-01-02	4897.93	18	train
2	109	NaT	NaN	19	test
3	112	2019-01-04	NaN	19	test
4	115	2019-01-05	3830.42	12	train

Operations

df.mean() # df.mean(axis = 0)

ID        1.718435e+16
VALUE     3.450255e+03
VOLUME    1.400000e+01
dtype: float64

 df.mean(1) ## df.mean(axis = 1) ; gives mean for numeric data only

   733.975
  2457.965
  1966.970
  1212.435
  1921.210
  2100.210
dtype: float64

Apply

Any built in or self written function can be applied to column of a dataframe

max_val_vol = df[['VALUE', 'VOLUME']].apply(max)
max_val_vol
min_val_vol = df[['VALUE', 'VOLUME']].apply(min)
min_val_vol
diff_max_min = max_val_vol-min_val_vol
diff_max_min

VALUE     4897.93
VOLUME      19.00
dtype: float64

VALUE     1453.95
VOLUME       2.00
dtype: float64

VALUE     3443.98
VOLUME      17.00
dtype: float64

Above operation can be done using lambda function in pyhton

Syntax - lambda arguments : expression

df[['VALUE', 'VOLUME']].apply(lambda x: x.max() - x.min())

VALUE     3443.98
VOLUME      17.00
dtype: float64

Merge / Join

np.random.seed(42)
sales = pd.DataFrame({'order_id' : np.random.randint(100,105,7), 
                      'item' : ['apple', 'guava', 'grape', 'grape','apple', 'guava', 'papaya'],
                     'date' : ['2013-01-03', '2013-01-01','2013-01-02','2013-01-01','2013-01-03','2013-01-03','2013-01-02'] })
sales

	order_id	item	date
0	103	apple	2013-01-03
1	104	guava	2013-01-01
2	102	grape	2013-01-02
3	104	grape	2013-01-01
4	104	apple	2013-01-03
5	101	guava	2013-01-03
6	102	papaya	2013-01-02

cost = pd.DataFrame({'item': ['apple', 'guava', 'grape'],
                    'cost/kg' : [200,90,170]})
cost

	item	cost/kg
0	apple	200
1	guava	90
2	grape	170

Left Join

sales_df = sales.merge(cost,on = 'item', how = 'left')
sales_df

	order_id	item	date	cost/kg
0	103	apple	2013-01-03	200.0
1	104	guava	2013-01-01	90.0
2	102	grape	2013-01-02	170.0
3	104	grape	2013-01-01	170.0
4	104	apple	2013-01-03	200.0
5	101	guava	2013-01-03	90.0
6	102	papaya	2013-01-02	NaN

Right Join

sales.merge(cost,on = ‘item’, how = ‘right’)

Inner Join

sales.merge(cost,on = ‘item’, how = ‘inner’)

Row Binding two dataframes

pd.concat([sales,cost], axis = 0)

C:\Users\392256\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  """Entry point for launching an IPython kernel.

	cost/kg	date	item	order_id
0	NaN	2013-01-03	apple	103.0
1	NaN	2013-01-01	guava	104.0
2	NaN	2013-01-02	grape	102.0
3	NaN	2013-01-01	grape	104.0
4	NaN	2013-01-03	apple	104.0
5	NaN	2013-01-03	guava	101.0
6	NaN	2013-01-02	papaya	102.0
0	200.0	NaN	apple	NaN
1	90.0	NaN	guava	NaN
2	170.0	NaN	grape	NaN

Column binding two dataframes

pd.concat([sales,cost], axis = 1)

	order_id	item	date	item	cost/kg
0	103	apple	2013-01-03	apple	200.0
1	104	guava	2013-01-01	guava	90.0
2	102	grape	2013-01-02	grape	170.0
3	104	grape	2013-01-01	NaN	NaN
4	104	apple	2013-01-03	NaN	NaN
5	101	guava	2013-01-03	NaN	NaN
6	102	papaya	2013-01-02	NaN	NaN

Group by

# replacing NA in above dataframe by mean
sales_df = sales_df.fillna(sales_df.mean())
sales_df

	order_id	item	date	cost/kg
0	103	apple	2013-01-03	200.000000
1	104	guava	2013-01-01	90.000000
2	102	grape	2013-01-02	170.000000
3	104	grape	2013-01-01	170.000000
4	104	apple	2013-01-03	200.000000
5	101	guava	2013-01-03	90.000000
6	102	papaya	2013-01-02	153.333333

calculating order id wise mean cost per kg

mean_cost = sales_df.groupby('order_id')['cost/kg'].mean()
mean_cost

order_id
101     90.000000
102    161.666667
103    200.000000
104    153.333333
Name: cost/kg, dtype: float64

Adding column using Map Function

objective : Add a column ‘mean_cost_per_kg’ which contains order id wise average cost per kg.

sales_df['mean_cost_per_kg'] = sales_df.order_id.map(sales_df.groupby('order_id')['cost/kg'].mean().to_dict())
sales_df

	order_id	item	date	cost/kg	mean_cost_per_kg
0	103	apple	2013-01-03	200.000000	200.000000
1	104	guava	2013-01-01	90.000000	153.333333
2	102	grape	2013-01-02	170.000000	161.666667
3	104	grape	2013-01-01	170.000000	153.333333
4	104	apple	2013-01-03	200.000000	153.333333
5	101	guava	2013-01-03	90.000000	90.000000
6	102	papaya	2013-01-02	153.333333	161.666667

Getting data in/out

# writing data in excel or csv file

#path = '/mapr/hdpbiu/_392256_vishnu/softwares_for_new_joinee'   # /mapr/hdpbiu = /Biu_load  for server
path = 'C:/Users/392256/Documents/Intro to Python'  # 

sales_df.to_csv(f'{path}/sales_df.csv', index = False)
sales_df.to_excel(f'{path}/sales_df.xlsx', sheet_name='Sheet1', index = False)
#sales_df.to_parquet(f'{path}/sales_df', engine = 'auto',index = False)

Reading from an csv or excel file

from datetime import datetime
df = pd.read_csv(f'{path}/sales_df.csv', dtype = {'order_id' : str })
df.dtypes

order_id             object
item                 object
date                 object
cost/kg             float64
mean_cost_per_kg    float64
dtype: object

Vishnu Prakash Singh