01 Oct,2019
By: Vishnu Prakash Singh
from IPython.display import Image;from datetime import date
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Pandas Series
- Pandas Series is a one-dimensional labeled array capable of holding data of any type
- The axis labels are collectively called index.
- Labels need not be unique but must be a hashable type.
import pandas as pd
data = pd.Series(np.random.randint(10, 20,6), index=pd.date_range('20130101', periods=6))
print('Type of the series is ' + str(type(data)))
Type of the series is <class ‘pandas.core.series.Series’>
print('Index of the series is ' + str(data.index))
Index of the series is DatetimeIndex([‘2013-01-01’, ‘2013-01-02’, ‘2013-01-03’, ‘2013-01-04’,’2013-01-05’, ‘2013-01-06’],dtype=’datetime64[ns]’, freq=’D’)
print('Values of the series is ' + str(data.values))
Values of the series is [12 14 18 12 13 10]
print('Data Type of the series is ' + str(data.dtype))
Data Type of the series is int32
Pandas DataFrame
np.random.seed(47)
df = pd.DataFrame({'ID': map(str,np.arange(103,120,3)), # map function maps str function to each item of list
'DATE': pd.date_range('20190101', periods=6),
'VALUE': np.round(np.random.uniform(1000, 5000,6),2),
'VOLUME': np.random.randint(1, 20,6),
'SET': np.random.choice(["test", "train"],6)})
df.shape
df
(6, 5)
|
ID |
DATE |
VALUE |
VOLUME |
SET |
| 0 |
103 |
2019-01-01 |
1453.95 |
14 |
train |
| 1 |
106 |
2019-01-02 |
4897.93 |
18 |
train |
| 2 |
109 |
2019-01-03 |
3914.94 |
19 |
test |
| 3 |
112 |
2019-01-04 |
2405.87 |
19 |
test |
| 4 |
115 |
2019-01-05 |
3830.42 |
12 |
train |
| 5 |
118 |
2019-01-06 |
4198.42 |
2 |
test |
Checking data type of all columns
ID object
DATE datetime64[ns]
VALUE float64
VOLUME int32
SET object
dtype: object
info() returns various details of the dataframe
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 5 columns):
ID 6 non-null object
DATE 6 non-null datetime64[ns]
VALUE 6 non-null float64
VOLUME 6 non-null int32
SET 6 non-null object
dtypes: datetime64[ns](1), float64(1), int32(1), object(2)
memory usage: 344.0+ bytes
describe() shows a quick statistic summary of your data:
df.describe() # only takes numeric columns under consideration
|
VALUE |
VOLUME |
| count |
6.000000 |
6.000000 |
| mean |
3450.255000 |
14.000000 |
| std |
1272.159191 |
6.542171 |
| min |
1453.950000 |
2.000000 |
| 25% |
2762.007500 |
12.500000 |
| 50% |
3872.680000 |
16.000000 |
| 75% |
4127.550000 |
18.750000 |
| max |
4897.930000 |
19.000000 |
Viewing pandas data
|
ID |
DATE |
VALUE |
VOLUME |
SET |
| 0 |
103 |
2019-01-01 |
1453.95 |
14 |
train |
| 1 |
106 |
2019-01-02 |
4897.93 |
18 |
train |
|
ID |
DATE |
VALUE |
VOLUME |
SET |
| 4 |
115 |
2019-01-05 |
3830.42 |
12 |
train |
| 5 |
118 |
2019-01-06 |
4198.42 |
2 |
test |
Display the index, columns:
df.index.to_list() # converting index to list for better printing
df.columns
[0, 1, 2, 3, 4, 5]
Index([‘ID’, ‘DATE’, ‘VALUE’, ‘VOLUME’, ‘SET’], dtype=’object’)
pandas to numpy
Note DataFrame.to_numpy() does not include the index or column labels in the output.
array([[‘103’, Timestamp(‘2019-01-01 00:00:00’), 1453.95, 14, ‘train’],
[‘106’, Timestamp(‘2019-01-02 00:00:00’), 4897.93, 18, ‘train’],
[‘109’, Timestamp(‘2019-01-03 00:00:00’), 3914.94, 19, ‘test’],
[‘112’, Timestamp(‘2019-01-04 00:00:00’), 2405.87, 19, ‘test’],
[‘115’, Timestamp(‘2019-01-05 00:00:00’), 3830.42, 12, ‘train’],
[‘118’, Timestamp(‘2019-01-06 00:00:00’), 4198.42, 2, ‘test’]],
dtype=object)
Converting datatype of variable
df.dtypes
df['VOLUME'] = df.VOLUME.astype(str) # converting from integer to string
df.VOLUME.dtype
df['ID'] = df.ID.astype('category') # converting from string to factor
df.ID.dtype
df['VOLUME'] = df.VOLUME.astype(int) # converting from string to integer
df.VOLUME.dtype
ID object
DATE datetime64[ns]
VALUE float64
VOLUME int32
SET object
dtype: object
dtype(‘O’)
CategoricalDtype(categories=[‘103’, ‘106’, ‘109’, ‘112’, ‘115’, ‘118’], ordered=False)
dtype(‘int32’)
Transposing your data:
df.T.shape # rows and columns got interchanged
df.T
(5, 6)
|
0 |
1 |
2 |
3 |
4 |
5 |
| ID |
103 |
106 |
109 |
112 |
115 |
118 |
| DATE |
2019-01-01 00:00:00 |
2019-01-02 00:00:00 |
2019-01-03 00:00:00 |
2019-01-04 00:00:00 |
2019-01-05 00:00:00 |
2019-01-06 00:00:00 |
| VALUE |
1453.95 |
4897.93 |
3914.94 |
2405.87 |
3830.42 |
4198.42 |
| VOLUME |
14 |
18 |
19 |
19 |
12 |
2 |
| SET |
train |
train |
test |
test |
train |
test |
Sorting by an axis:
- axis 0 means row wise operation
- axis 1 means column wise operation
# sorting by the row index
df.sort_index(axis=0, ascending=False)
|
ID |
DATE |
VALUE |
VOLUME |
SET |
| 5 |
118 |
2019-01-06 |
4198.42 |
2 |
test |
| 4 |
115 |
2019-01-05 |
3830.42 |
12 |
train |
| 3 |
112 |
2019-01-04 |
2405.87 |
19 |
test |
| 2 |
109 |
2019-01-03 |
3914.94 |
19 |
test |
| 1 |
106 |
2019-01-02 |
4897.93 |
18 |
train |
| 0 |
103 |
2019-01-01 |
1453.95 |
14 |
train |
# sorting by the column index
df.sort_index(axis=1, ascending=False)
|
VOLUME |
VALUE |
SET |
ID |
DATE |
| 0 |
14 |
1453.95 |
train |
103 |
2019-01-01 |
| 1 |
18 |
4897.93 |
train |
106 |
2019-01-02 |
| 2 |
19 |
3914.94 |
test |
109 |
2019-01-03 |
| 3 |
19 |
2405.87 |
test |
112 |
2019-01-04 |
| 4 |
12 |
3830.42 |
train |
115 |
2019-01-05 |
| 5 |
2 |
4198.42 |
test |
118 |
2019-01-06 |
# sorting by a column
df.sort_values(by='VOLUME', ascending = False)
|
ID |
DATE |
VALUE |
VOLUME |
SET |
| 2 |
109 |
2019-01-03 |
3914.94 |
19 |
test |
| 3 |
112 |
2019-01-04 |
2405.87 |
19 |
test |
| 1 |
106 |
2019-01-02 |
4897.93 |
18 |
train |
| 0 |
103 |
2019-01-01 |
1453.95 |
14 |
train |
| 4 |
115 |
2019-01-05 |
3830.42 |
12 |
train |
| 5 |
118 |
2019-01-06 |
4198.42 |
2 |
test |
Selection of subset of data
#Selecting a single column, which yields a Series, equivalent to df.ID
df['ID'] # returns pandas.core.series.Series
0 103
1 106
2 109
3 112
4 115
5 118
Name: ID, dtype: category
Categories (6, object): [103, 106, 109, 112, 115, 118]
#Selecting via [], which slices the rows.
df[0:3]
|
ID |
DATE |
VALUE |
VOLUME |
SET |
| 0 |
103 |
2019-01-01 |
1453.95 |
14 |
train |
| 1 |
106 |
2019-01-02 |
4897.93 |
18 |
train |
| 2 |
109 |
2019-01-03 |
3914.94 |
19 |
test |
Selection by label using loc
df.loc[:,['VALUE']] # returns pandas.core.frame.DataFrame
|
VALUE |
| 0 |
1453.95 |
| 1 |
4897.93 |
| 2 |
3914.94 |
| 3 |
2405.87 |
| 4 |
3830.42 |
| 5 |
4198.42 |
|
ID |
DATE |
VALUE |
VOLUME |
SET |
| 0 |
103 |
2019-01-01 |
1453.95 |
14 |
train |
| 1 |
106 |
2019-01-02 |
4897.93 |
18 |
train |
df.loc[1:2:, ['ID', 'VALUE']]
|
ID |
VALUE |
| 1 |
106 |
4897.93 |
| 2 |
109 |
3914.94 |
Selection by position using iloc
ID 112
DATE 2019-01-04 00:00:00
VALUE 2405.87
VOLUME 19
SET test
Name: 3, dtype: object
df.iloc[[1, 2, 4], [0, 2]]
|
ID |
VALUE |
| 1 |
106 |
4897.93 |
| 2 |
109 |
3914.94 |
| 4 |
115 |
3830.42 |
Boolean indexing
|
ID |
DATE |
VALUE |
VOLUME |
SET |
| 1 |
106 |
2019-01-02 |
4897.93 |
18 |
train |
| 2 |
109 |
2019-01-03 |
3914.94 |
19 |
test |
| 4 |
115 |
2019-01-05 |
3830.42 |
12 |
train |
| 5 |
118 |
2019-01-06 |
4198.42 |
2 |
test |
df[df.SET.isin(['train'])] # df[df.SET=='train']
|
ID |
DATE |
VALUE |
VOLUME |
SET |
| 0 |
103 |
2019-01-01 |
1453.95 |
14 |
train |
| 1 |
106 |
2019-01-02 |
4897.93 |
18 |
train |
| 4 |
115 |
2019-01-05 |
3830.42 |
12 |
train |
|
ID |
DATE |
VALUE |
VOLUME |
SET |
| 0 |
103 |
2019-01-01 |
1453.95 |
14 |
train |
| 1 |
106 |
2019-01-02 |
4897.93 |
18 |
train |
| 4 |
115 |
2019-01-05 |
3830.42 |
12 |
train |
Missing data
df1 = df.copy() # creates a copy of dataframe
df1.loc[2,'DATE'] = np.NaN # replacing value with NA
df1.loc[2:3,'VALUE'] = np.NaN # replacing value with NA
df1.loc[5,'SET'] = np.NaN # replacing value with NA
# checking counts of missing values
df1.VALUE.isna().sum()
df1.isna().sum()
2
ID 0
DATE 1
VALUE 2
VOLUME 0
SET 1
dtype: int64
#To drop any rows that have missing data.
df1.dropna(how='any')
|
ID |
DATE |
VALUE |
VOLUME |
SET |
| 0 |
103 |
2019-01-01 |
1453.95 |
14 |
train |
| 1 |
106 |
2019-01-02 |
4897.93 |
18 |
train |
| 4 |
115 |
2019-01-05 |
3830.42 |
12 |
train |
#To drop a row specific to a column that have missing data.
df1.dropna(subset = ['SET'],inplace = True)
# when inplace = True, the output is stored in the same dataframe on which operation is done
df1
|
ID |
DATE |
VALUE |
VOLUME |
SET |
| 0 |
103 |
2019-01-01 |
1453.95 |
14 |
train |
| 1 |
106 |
2019-01-02 |
4897.93 |
18 |
train |
| 2 |
109 |
NaT |
NaN |
19 |
test |
| 3 |
112 |
2019-01-04 |
NaN |
19 |
test |
| 4 |
115 |
2019-01-05 |
3830.42 |
12 |
train |
Operations
df.mean() # df.mean(axis = 0)
ID 1.718435e+16
VALUE 3.450255e+03
VOLUME 1.400000e+01
dtype: float64
df.mean(1) ## df.mean(axis = 1) ; gives mean for numeric data only
0 733.975
1 2457.965
2 1966.970
3 1212.435
4 1921.210
5 2100.210
dtype: float64
Apply
Any built in or self written function can be applied to column of a dataframe
max_val_vol = df[['VALUE', 'VOLUME']].apply(max)
max_val_vol
min_val_vol = df[['VALUE', 'VOLUME']].apply(min)
min_val_vol
diff_max_min = max_val_vol-min_val_vol
diff_max_min
VALUE 4897.93
VOLUME 19.00
dtype: float64
VALUE 1453.95
VOLUME 2.00
dtype: float64
VALUE 3443.98
VOLUME 17.00
dtype: float64
Above operation can be done using lambda function in pyhton
Syntax - lambda arguments : expression
df[['VALUE', 'VOLUME']].apply(lambda x: x.max() - x.min())
VALUE 3443.98
VOLUME 17.00
dtype: float64
Merge / Join
np.random.seed(42)
sales = pd.DataFrame({'order_id' : np.random.randint(100,105,7),
'item' : ['apple', 'guava', 'grape', 'grape','apple', 'guava', 'papaya'],
'date' : ['2013-01-03', '2013-01-01','2013-01-02','2013-01-01','2013-01-03','2013-01-03','2013-01-02'] })
sales
|
order_id |
item |
date |
| 0 |
103 |
apple |
2013-01-03 |
| 1 |
104 |
guava |
2013-01-01 |
| 2 |
102 |
grape |
2013-01-02 |
| 3 |
104 |
grape |
2013-01-01 |
| 4 |
104 |
apple |
2013-01-03 |
| 5 |
101 |
guava |
2013-01-03 |
| 6 |
102 |
papaya |
2013-01-02 |
cost = pd.DataFrame({'item': ['apple', 'guava', 'grape'],
'cost/kg' : [200,90,170]})
cost
|
item |
cost/kg |
| 0 |
apple |
200 |
| 1 |
guava |
90 |
| 2 |
grape |
170 |
Left Join
sales_df = sales.merge(cost,on = 'item', how = 'left')
sales_df
|
order_id |
item |
date |
cost/kg |
| 0 |
103 |
apple |
2013-01-03 |
200.0 |
| 1 |
104 |
guava |
2013-01-01 |
90.0 |
| 2 |
102 |
grape |
2013-01-02 |
170.0 |
| 3 |
104 |
grape |
2013-01-01 |
170.0 |
| 4 |
104 |
apple |
2013-01-03 |
200.0 |
| 5 |
101 |
guava |
2013-01-03 |
90.0 |
| 6 |
102 |
papaya |
2013-01-02 |
NaN |
Right Join
- sales.merge(cost,on = ‘item’, how = ‘right’)
Inner Join
- sales.merge(cost,on = ‘item’, how = ‘inner’)
Row Binding two dataframes
pd.concat([sales,cost], axis = 0)
C:\Users\392256\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
"""Entry point for launching an IPython kernel.
|
cost/kg |
date |
item |
order_id |
| 0 |
NaN |
2013-01-03 |
apple |
103.0 |
| 1 |
NaN |
2013-01-01 |
guava |
104.0 |
| 2 |
NaN |
2013-01-02 |
grape |
102.0 |
| 3 |
NaN |
2013-01-01 |
grape |
104.0 |
| 4 |
NaN |
2013-01-03 |
apple |
104.0 |
| 5 |
NaN |
2013-01-03 |
guava |
101.0 |
| 6 |
NaN |
2013-01-02 |
papaya |
102.0 |
| 0 |
200.0 |
NaN |
apple |
NaN |
| 1 |
90.0 |
NaN |
guava |
NaN |
| 2 |
170.0 |
NaN |
grape |
NaN |
Column binding two dataframes
pd.concat([sales,cost], axis = 1)
|
order_id |
item |
date |
item |
cost/kg |
| 0 |
103 |
apple |
2013-01-03 |
apple |
200.0 |
| 1 |
104 |
guava |
2013-01-01 |
guava |
90.0 |
| 2 |
102 |
grape |
2013-01-02 |
grape |
170.0 |
| 3 |
104 |
grape |
2013-01-01 |
NaN |
NaN |
| 4 |
104 |
apple |
2013-01-03 |
NaN |
NaN |
| 5 |
101 |
guava |
2013-01-03 |
NaN |
NaN |
| 6 |
102 |
papaya |
2013-01-02 |
NaN |
NaN |
Group by
# replacing NA in above dataframe by mean
sales_df = sales_df.fillna(sales_df.mean())
sales_df
|
order_id |
item |
date |
cost/kg |
| 0 |
103 |
apple |
2013-01-03 |
200.000000 |
| 1 |
104 |
guava |
2013-01-01 |
90.000000 |
| 2 |
102 |
grape |
2013-01-02 |
170.000000 |
| 3 |
104 |
grape |
2013-01-01 |
170.000000 |
| 4 |
104 |
apple |
2013-01-03 |
200.000000 |
| 5 |
101 |
guava |
2013-01-03 |
90.000000 |
| 6 |
102 |
papaya |
2013-01-02 |
153.333333 |
calculating order id wise mean cost per kg
mean_cost = sales_df.groupby('order_id')['cost/kg'].mean()
mean_cost
order_id
101 90.000000
102 161.666667
103 200.000000
104 153.333333
Name: cost/kg, dtype: float64
Adding column using Map Function
- objective : Add a column ‘mean_cost_per_kg’ which contains order id wise average cost per kg.
sales_df['mean_cost_per_kg'] = sales_df.order_id.map(sales_df.groupby('order_id')['cost/kg'].mean().to_dict())
sales_df
|
order_id |
item |
date |
cost/kg |
mean_cost_per_kg |
| 0 |
103 |
apple |
2013-01-03 |
200.000000 |
200.000000 |
| 1 |
104 |
guava |
2013-01-01 |
90.000000 |
153.333333 |
| 2 |
102 |
grape |
2013-01-02 |
170.000000 |
161.666667 |
| 3 |
104 |
grape |
2013-01-01 |
170.000000 |
153.333333 |
| 4 |
104 |
apple |
2013-01-03 |
200.000000 |
153.333333 |
| 5 |
101 |
guava |
2013-01-03 |
90.000000 |
90.000000 |
| 6 |
102 |
papaya |
2013-01-02 |
153.333333 |
161.666667 |
Getting data in/out
# writing data in excel or csv file
#path = '/mapr/hdpbiu/_392256_vishnu/softwares_for_new_joinee' # /mapr/hdpbiu = /Biu_load for server
path = 'C:/Users/392256/Documents/Intro to Python' #
sales_df.to_csv(f'{path}/sales_df.csv', index = False)
sales_df.to_excel(f'{path}/sales_df.xlsx', sheet_name='Sheet1', index = False)
#sales_df.to_parquet(f'{path}/sales_df', engine = 'auto',index = False)
Reading from an csv or excel file
from datetime import datetime
df = pd.read_csv(f'{path}/sales_df.csv', dtype = {'order_id' : str })
df.dtypes
order_id object
item object
date object
cost/kg float64
mean_cost_per_kg float64
dtype: object
- pd.read_excel for reading excel files
- pd.read_parquet for reading parquet files
- pd.read_clipboard for reading from clipboard
- pd.read_pickle for reading pickle files in which ML models are stored
THE END</h1>