Python Tutorials part III Pandas Tutorial

9 minute read

01 Oct,2019
By: Vishnu Prakash Singh
from IPython.display import Image;from datetime import date
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Pandas Series

  • Pandas Series is a one-dimensional labeled array capable of holding data of any type
  • The axis labels are collectively called index.
  • Labels need not be unique but must be a hashable type.
import pandas as pd
data = pd.Series(np.random.randint(10, 20,6), index=pd.date_range('20130101', periods=6))
print('Type of the series is ' + str(type(data)))

Type of the series is <class ‘pandas.core.series.Series’>

print('Index of the series is ' + str(data.index))

Index of the series is DatetimeIndex([‘2013-01-01’, ‘2013-01-02’, ‘2013-01-03’, ‘2013-01-04’,’2013-01-05’, ‘2013-01-06’],dtype=’datetime64[ns]’, freq=’D’)

print('Values of the series is ' + str(data.values))

Values of the series is [12 14 18 12 13 10]

print('Data Type of the series is ' + str(data.dtype))

Data Type of the series is int32

Pandas DataFrame

np.random.seed(47)
df = pd.DataFrame({'ID': map(str,np.arange(103,120,3)),   # map function maps str function to each item of list
                     'DATE': pd.date_range('20190101', periods=6),
                     'VALUE': np.round(np.random.uniform(1000, 5000,6),2),
                     'VOLUME': np.random.randint(1, 20,6),
                     'SET': np.random.choice(["test", "train"],6)})
df.shape
df

(6, 5)

ID DATE VALUE VOLUME SET
0 103 2019-01-01 1453.95 14 train
1 106 2019-01-02 4897.93 18 train
2 109 2019-01-03 3914.94 19 test
3 112 2019-01-04 2405.87 19 test
4 115 2019-01-05 3830.42 12 train
5 118 2019-01-06 4198.42 2 test
Checking data type of all columns
df.dtypes
ID                object
DATE      datetime64[ns]
VALUE            float64
VOLUME             int32
SET               object
dtype: object
info() returns various details of the dataframe
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 5 columns):
ID        6 non-null object
DATE      6 non-null datetime64[ns]
VALUE     6 non-null float64
VOLUME    6 non-null int32
SET       6 non-null object
dtypes: datetime64[ns](1), float64(1), int32(1), object(2)
memory usage: 344.0+ bytes
describe() shows a quick statistic summary of your data:
df.describe()   # only takes numeric columns under consideration
VALUE VOLUME
count 6.000000 6.000000
mean 3450.255000 14.000000
std 1272.159191 6.542171
min 1453.950000 2.000000
25% 2762.007500 12.500000
50% 3872.680000 16.000000
75% 4127.550000 18.750000
max 4897.930000 19.000000
Viewing pandas data
df.head(2)
df.tail(2)
ID DATE VALUE VOLUME SET
0 103 2019-01-01 1453.95 14 train
1 106 2019-01-02 4897.93 18 train
ID DATE VALUE VOLUME SET
4 115 2019-01-05 3830.42 12 train
5 118 2019-01-06 4198.42 2 test
Display the index, columns:
df.index.to_list() # converting index to list for better printing
df.columns

[0, 1, 2, 3, 4, 5]

Index([‘ID’, ‘DATE’, ‘VALUE’, ‘VOLUME’, ‘SET’], dtype=’object’)

pandas to numpy
Note DataFrame.to_numpy() does not include the index or column labels in the output.
df.to_numpy()

array([[‘103’, Timestamp(‘2019-01-01 00:00:00’), 1453.95, 14, ‘train’], [‘106’, Timestamp(‘2019-01-02 00:00:00’), 4897.93, 18, ‘train’], [‘109’, Timestamp(‘2019-01-03 00:00:00’), 3914.94, 19, ‘test’], [‘112’, Timestamp(‘2019-01-04 00:00:00’), 2405.87, 19, ‘test’], [‘115’, Timestamp(‘2019-01-05 00:00:00’), 3830.42, 12, ‘train’], [‘118’, Timestamp(‘2019-01-06 00:00:00’), 4198.42, 2, ‘test’]], dtype=object)

Converting datatype of variable
df.dtypes

df['VOLUME'] = df.VOLUME.astype(str)  # converting from integer to string
df.VOLUME.dtype

df['ID'] = df.ID.astype('category')   # converting from string to factor
df.ID.dtype

df['VOLUME'] = df.VOLUME.astype(int)   # converting from string to integer
df.VOLUME.dtype
ID                object
DATE      datetime64[ns]
VALUE            float64
VOLUME             int32
SET               object
dtype: object

dtype(‘O’)

CategoricalDtype(categories=[‘103’, ‘106’, ‘109’, ‘112’, ‘115’, ‘118’], ordered=False)

dtype(‘int32’)

Transposing your data:
df.T.shape  # rows and columns got interchanged
df.T

(5, 6)

0 1 2 3 4 5
ID 103 106 109 112 115 118
DATE 2019-01-01 00:00:00 2019-01-02 00:00:00 2019-01-03 00:00:00 2019-01-04 00:00:00 2019-01-05 00:00:00 2019-01-06 00:00:00
VALUE 1453.95 4897.93 3914.94 2405.87 3830.42 4198.42
VOLUME 14 18 19 19 12 2
SET train train test test train test
Sorting by an axis:
  • axis 0 means row wise operation
  • axis 1 means column wise operation
# sorting by the row index
df.sort_index(axis=0, ascending=False)
ID DATE VALUE VOLUME SET
5 118 2019-01-06 4198.42 2 test
4 115 2019-01-05 3830.42 12 train
3 112 2019-01-04 2405.87 19 test
2 109 2019-01-03 3914.94 19 test
1 106 2019-01-02 4897.93 18 train
0 103 2019-01-01 1453.95 14 train
# sorting by the column index
df.sort_index(axis=1, ascending=False)
VOLUME VALUE SET ID DATE
0 14 1453.95 train 103 2019-01-01
1 18 4897.93 train 106 2019-01-02
2 19 3914.94 test 109 2019-01-03
3 19 2405.87 test 112 2019-01-04
4 12 3830.42 train 115 2019-01-05
5 2 4198.42 test 118 2019-01-06
# sorting by a column
df.sort_values(by='VOLUME', ascending = False)
ID DATE VALUE VOLUME SET
2 109 2019-01-03 3914.94 19 test
3 112 2019-01-04 2405.87 19 test
1 106 2019-01-02 4897.93 18 train
0 103 2019-01-01 1453.95 14 train
4 115 2019-01-05 3830.42 12 train
5 118 2019-01-06 4198.42 2 test

Selection of subset of data

#Selecting a single column, which yields a Series, equivalent to df.ID
df['ID']  # returns pandas.core.series.Series
0    103
1    106
2    109
3    112
4    115
5    118
Name: ID, dtype: category
Categories (6, object): [103, 106, 109, 112, 115, 118]
#Selecting via [], which slices the rows.
df[0:3]
ID DATE VALUE VOLUME SET
0 103 2019-01-01 1453.95 14 train
1 106 2019-01-02 4897.93 18 train
2 109 2019-01-03 3914.94 19 test
Selection by label using loc
df.loc[:,['VALUE']] # returns pandas.core.frame.DataFrame
VALUE
0 1453.95
1 4897.93
2 3914.94
3 2405.87
4 3830.42
5 4198.42
df.loc[0:1]
ID DATE VALUE VOLUME SET
0 103 2019-01-01 1453.95 14 train
1 106 2019-01-02 4897.93 18 train
df.loc[1:2:, ['ID', 'VALUE']]
ID VALUE
1 106 4897.93
2 109 3914.94

Selection by position using iloc

df.iloc[3]
ID                        112
DATE      2019-01-04 00:00:00
VALUE                 2405.87
VOLUME                     19
SET                      test
Name: 3, dtype: object
df.iloc[[1, 2, 4], [0, 2]]
ID VALUE
1 106 4897.93
2 109 3914.94
4 115 3830.42

Boolean indexing

df[df.VALUE > 3500]
ID DATE VALUE VOLUME SET
1 106 2019-01-02 4897.93 18 train
2 109 2019-01-03 3914.94 19 test
4 115 2019-01-05 3830.42 12 train
5 118 2019-01-06 4198.42 2 test
df[df.SET.isin(['train'])]  # df[df.SET=='train']
ID DATE VALUE VOLUME SET
0 103 2019-01-01 1453.95 14 train
1 106 2019-01-02 4897.93 18 train
4 115 2019-01-05 3830.42 12 train
df[df.SET=='train']
ID DATE VALUE VOLUME SET
0 103 2019-01-01 1453.95 14 train
1 106 2019-01-02 4897.93 18 train
4 115 2019-01-05 3830.42 12 train

Missing data

df1 = df.copy() # creates a copy of dataframe
df1.loc[2,'DATE'] = np.NaN   # replacing value with NA
df1.loc[2:3,'VALUE'] = np.NaN # replacing value with NA
df1.loc[5,'SET'] = np.NaN # replacing value with NA
# checking counts of missing values
df1.VALUE.isna().sum()
df1.isna().sum()

2

ID        0
DATE      1
VALUE     2
VOLUME    0
SET       1
dtype: int64
#To drop any rows that have missing data.
df1.dropna(how='any')
ID DATE VALUE VOLUME SET
0 103 2019-01-01 1453.95 14 train
1 106 2019-01-02 4897.93 18 train
4 115 2019-01-05 3830.42 12 train
#To drop a row specific to a column that have missing data.
df1.dropna(subset = ['SET'],inplace = True) 
# when inplace = True, the output is stored in the same dataframe on which operation is done
df1
ID DATE VALUE VOLUME SET
0 103 2019-01-01 1453.95 14 train
1 106 2019-01-02 4897.93 18 train
2 109 NaT NaN 19 test
3 112 2019-01-04 NaN 19 test
4 115 2019-01-05 3830.42 12 train

Operations

df.mean() # df.mean(axis = 0)
ID        1.718435e+16
VALUE     3.450255e+03
VOLUME    1.400000e+01
dtype: float64
 df.mean(1) ## df.mean(axis = 1) ; gives mean for numeric data only
0     733.975
1    2457.965
2    1966.970
3    1212.435
4    1921.210
5    2100.210
dtype: float64

Apply

Any built in or self written function can be applied to column of a dataframe

max_val_vol = df[['VALUE', 'VOLUME']].apply(max)
max_val_vol
min_val_vol = df[['VALUE', 'VOLUME']].apply(min)
min_val_vol
diff_max_min = max_val_vol-min_val_vol
diff_max_min
VALUE     4897.93
VOLUME      19.00
dtype: float64






VALUE     1453.95
VOLUME       2.00
dtype: float64






VALUE     3443.98
VOLUME      17.00
dtype: float64
Above operation can be done using lambda function in pyhton

Syntax - lambda arguments : expression

df[['VALUE', 'VOLUME']].apply(lambda x: x.max() - x.min())
VALUE     3443.98
VOLUME      17.00
dtype: float64

Merge / Join

np.random.seed(42)
sales = pd.DataFrame({'order_id' : np.random.randint(100,105,7), 
                      'item' : ['apple', 'guava', 'grape', 'grape','apple', 'guava', 'papaya'],
                     'date' : ['2013-01-03', '2013-01-01','2013-01-02','2013-01-01','2013-01-03','2013-01-03','2013-01-02'] })
sales
order_id item date
0 103 apple 2013-01-03
1 104 guava 2013-01-01
2 102 grape 2013-01-02
3 104 grape 2013-01-01
4 104 apple 2013-01-03
5 101 guava 2013-01-03
6 102 papaya 2013-01-02
cost = pd.DataFrame({'item': ['apple', 'guava', 'grape'],
                    'cost/kg' : [200,90,170]})
cost
item cost/kg
0 apple 200
1 guava 90
2 grape 170
Left Join
sales_df = sales.merge(cost,on = 'item', how = 'left')
sales_df
order_id item date cost/kg
0 103 apple 2013-01-03 200.0
1 104 guava 2013-01-01 90.0
2 102 grape 2013-01-02 170.0
3 104 grape 2013-01-01 170.0
4 104 apple 2013-01-03 200.0
5 101 guava 2013-01-03 90.0
6 102 papaya 2013-01-02 NaN
Right Join
  • sales.merge(cost,on = ‘item’, how = ‘right’)
Inner Join
  • sales.merge(cost,on = ‘item’, how = ‘inner’)
Row Binding two dataframes
pd.concat([sales,cost], axis = 0)
C:\Users\392256\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  """Entry point for launching an IPython kernel.
cost/kg date item order_id
0 NaN 2013-01-03 apple 103.0
1 NaN 2013-01-01 guava 104.0
2 NaN 2013-01-02 grape 102.0
3 NaN 2013-01-01 grape 104.0
4 NaN 2013-01-03 apple 104.0
5 NaN 2013-01-03 guava 101.0
6 NaN 2013-01-02 papaya 102.0
0 200.0 NaN apple NaN
1 90.0 NaN guava NaN
2 170.0 NaN grape NaN
Column binding two dataframes
pd.concat([sales,cost], axis = 1)
order_id item date item cost/kg
0 103 apple 2013-01-03 apple 200.0
1 104 guava 2013-01-01 guava 90.0
2 102 grape 2013-01-02 grape 170.0
3 104 grape 2013-01-01 NaN NaN
4 104 apple 2013-01-03 NaN NaN
5 101 guava 2013-01-03 NaN NaN
6 102 papaya 2013-01-02 NaN NaN

Group by

# replacing NA in above dataframe by mean
sales_df = sales_df.fillna(sales_df.mean())
sales_df
order_id item date cost/kg
0 103 apple 2013-01-03 200.000000
1 104 guava 2013-01-01 90.000000
2 102 grape 2013-01-02 170.000000
3 104 grape 2013-01-01 170.000000
4 104 apple 2013-01-03 200.000000
5 101 guava 2013-01-03 90.000000
6 102 papaya 2013-01-02 153.333333
calculating order id wise mean cost per kg
mean_cost = sales_df.groupby('order_id')['cost/kg'].mean()
mean_cost
order_id
101     90.000000
102    161.666667
103    200.000000
104    153.333333
Name: cost/kg, dtype: float64

Adding column using Map Function

  • objective : Add a column ‘mean_cost_per_kg’ which contains order id wise average cost per kg.
sales_df['mean_cost_per_kg'] = sales_df.order_id.map(sales_df.groupby('order_id')['cost/kg'].mean().to_dict())
sales_df
order_id item date cost/kg mean_cost_per_kg
0 103 apple 2013-01-03 200.000000 200.000000
1 104 guava 2013-01-01 90.000000 153.333333
2 102 grape 2013-01-02 170.000000 161.666667
3 104 grape 2013-01-01 170.000000 153.333333
4 104 apple 2013-01-03 200.000000 153.333333
5 101 guava 2013-01-03 90.000000 90.000000
6 102 papaya 2013-01-02 153.333333 161.666667
Getting data in/out
# writing data in excel or csv file

#path = '/mapr/hdpbiu/_392256_vishnu/softwares_for_new_joinee'   # /mapr/hdpbiu = /Biu_load  for server
path = 'C:/Users/392256/Documents/Intro to Python'  # 

sales_df.to_csv(f'{path}/sales_df.csv', index = False)
sales_df.to_excel(f'{path}/sales_df.xlsx', sheet_name='Sheet1', index = False)
#sales_df.to_parquet(f'{path}/sales_df', engine = 'auto',index = False)
Reading from an csv or excel file
from datetime import datetime
df = pd.read_csv(f'{path}/sales_df.csv', dtype = {'order_id' : str })
df.dtypes
order_id             object
item                 object
date                 object
cost/kg             float64
mean_cost_per_kg    float64
dtype: object
Reading Other file formats
  • pd.read_excel for reading excel files
  • pd.read_parquet for reading parquet files
  • pd.read_clipboard for reading from clipboard
  • pd.read_pickle for reading pickle files in which ML models are stored

THE END</h1>