Python – 5-2. pandas(DataFrame) like SQL
import pandas as pd df_emp = pd.read_csv('emp.csv', encoding='utf-8') df_dept = pd.read_csv('dept.csv', encoding='utf-8')
集計(dept別でsalが最大のempno)
SQL
WITH grp AS ( SELECT deptno, MAX(sal) as maxsal FROM emp GROUP BY deptno ) SELECT e.deptno, empno, sal FROM emp e INNER JOIN grp g ON (e.deptno = g.deptno AND e.sal = g.maxsal) ORDER BY e.deptno, empno;
pandas(deptでグループ化して最大のsalの金額でjoin)
※同金額が2件ある場合は2件とも取得される
pd.merge(df_emp, df_emp.groupby('dept')[['sal']].agg(max).reset_index() \ ,how='inner', right_on='dept', left_on='dept') \ .query('sal_x == sal_y') \ .sort_values('dept') ==> empno ename job manager hiredate sal_x comm dept sal_y ==> 12 7839 藤原太郎 PRESIDENT NaN 1981/11/17 5000 NaN 10.0 5000 ==> 2 7788 中村健一 ANALYST 7566.0 1987/04/19 3000 NaN 20.0 3000 ==> 4 7902 松本亮太 ANALYST 7566.0 1981/12/03 3000 NaN 20.0 3000 ==> 8 7698 伊藤誠 MANAGER 7839.0 1981/05/01 2850 NaN 30.0 2850
pandas(deptでグループ化しsalのランク1位のみを取得)
※同金額が2件ある場合、そのうち1件しか取得されない
※deptでjoinしないのでdeptがnullの行も抽出される
pd.merge(df_emp.reset_index(), df_emp.groupby('dept')[['sal']].rank(ascending=False, method='first') \ .reset_index(), how='inner', right_on='index', left_on='index') \ .query('sal_y == 1.0') \ .sort_values('dept') ==> index empno ename job manager hiredate sal_x comm dept sal_y ==> 8 8 7839 藤原太郎 PRESIDENT NaN 1981/11/17 5000 NaN 10.0 1.0 ==> 7 7 7788 中村健一 ANALYST 7566.0 1987/04/19 3000 NaN 20.0 1.0 ==> 5 5 7698 伊藤誠 MANAGER 7839.0 1981/05/01 2850 NaN 30.0 1.0
pandas(先にdept,sal毎にソートし、dept毎の最上位1件だけを取得)
df_emp.sort_values(by=['dept','sal'], ascending=[True,False]).groupby('dept').head(1) ==> empno ename job manager hiredate sal comm dept ==> 8 7839 藤原太郎 PRESIDENT NaN 1981/11/17 5000 NaN 10.0 ==> 7 7788 中村健一 ANALYST 7566.0 1987/04/19 3000 NaN 20.0 ==> 5 7698 伊藤誠 MANAGER 7839.0 1981/05/01 2850 NaN 30.0 ==> 14 8001 小林政男 SALESMAN 7698.0 1985/03/17 1300 NaN NaN
集計(dept別でsalが最大の行名(行index))
SQL(rowidを取得)
WITH grp AS ( SELECT dept, MAX(sal) as maxsal FROM emp GROUP BY dept ) SELECT rowid FROM emp e INNER JOIN grp g ON (e.deptno = g.deptno AND e.sal = g.maxsal) ORDER BY e.deptno, empno;
pandas(dept別のindex)
df_emp.groupby('dept')['sal'].idxmax() ==> dept ==> 10.0 8 ==> 20.0 7 ==> 30.0 5 ==> Name: sal, dtype: int64 ※複数件ある場合に、複数のindexを取得したい場合は、deptでgroupbyし、そのmax()と一致するdept,salを持ったindexを取得する ↓ pd.merge( df_emp[['dept','sal']].reset_index(), df_emp.groupby('dept')['sal'].max().reset_index(), how='inner', on=['dept','sal']) \ .loc[:,['dept','index']].sort_values('dept').reset_index(drop=True) ==> dept index ==> 0 10.0 8 ==> 1 20.0 7 ==> 2 20.0 12 ==> 3 30.0 5
縦結合(UNION)
SQL
SELECT * FROM emp WHERE dept = 10 UNION SELECT * FROM emp WHERE dept = 20;
pandas
pd.concat([df_emp[df_emp['dept']==10], df_emp[df_emp['dept']==20]]) もしくは、 pd.concat([df_emp.query('dept==10'), df_emp.query('dept==20')]) ※concatの引数はiterableなオブジェクト(リストやタプル)で指定する ==> empno ename job manager hiredate sal comm dept ==> 6 7782 山本尚人 MANAGER 7839.0 1981/06/09 2450 NaN 10.0 ==> 8 7839 藤原太郎 PRESIDENT NaN 1981/11/17 5000 NaN 10.0 ==> 13 7934 木村大輝 CLERK 7782.0 1982/01/23 1300 NaN 10.0 ==> 0 7369 佐藤一郎 CLERK 7902.0 1980/12/17 800 NaN 20.0 ==> 3 7566 田中義男 MANAGER 7839.0 1981/04/02 2975 NaN 20.0 ==> 7 7788 中村健一 ANALYST 7566.0 1987/04/19 3000 NaN 20.0 ==> 10 7876 吉田翔太 CLERK 7788.0 1987/05/23 1100 NaN 20.0 ==> 12 7902 松本亮太 ANALYST 7566.0 1981/12/03 3000 NaN 20.0
厳密にUNIONの動作(重複行は削除)に合わせたいときは
pd.concat([df_emp.query('dept==10'), df_emp.query('dept==20')]).drop_duplicates()
とする(この例では重複は発生しないので変わりないが)
最近のコメント