只要有用到資料庫,那一定會學到SQL(Structured Query Language-結構化查詢語言)的語法,SPSS和SAS也有類似的語法或套件,SAS EG更是大量使用SQL的語法。雖然版本很多,但是SQL的語法大同小異,在剛使用R的時候,由於對其他的指令還不熟悉,馬上就搜尋有無SQL的套件可以使用--有的,叫做SQLdf(http://cran.r-project.org/web/packages/sqldf/index.html)
SQLdf套件採用的是SQLite(http://www.sqlite.org/index.html),支援標準的SQL語法,也有包含常用的aggregate指令,對於熟悉SQL語法結構的人來說相當的方便。
SQL最基本的語句是SELECT:程式流程如下(來源:http://www.codedata.com.tw/database/mysql-tutorial-basic-query/)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
library("sqldf") | |
x <- sqldf(" | |
SELECT gender, age, sum(icome) | |
FROM client | |
GROUP BY gender, age | |
ORDER BY gender, age | |
") |
sqldf除了簡單的查詢與聚合指令外,也支援複雜的指令(包括子查詢與條件句):
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
good_ing <- sqldf(" | |
SELECT a.*,b.max, | |
CASE #這一段是說用max來recode變項 | |
WHEN b.max >= 80 THEN '80up' # | |
WHEN b.max >= 60 and b.max < 80 THEN '60-80' # | |
WHEN b.max >= 40 and b.max < 60 THEN '40-60' # | |
WHEN b.max < 40 THEN '40under' # | |
END AS ingLevel #<--將recode的變項重新命名 | |
FROM goodInfo as a, | |
(SELECT id, max(per) as max | |
FROM goodInfo | |
GROUP BY id | |
ORDER BY id) as b #<--這邊是子查詢 | |
WHERE a.id = b.id #<--這邊是說我把兩個表格合併的條件變項 | |
GROUP BY id | |
ORDER BY id | |
") |
由於sqldf用雙引號夾住sql指令,因此sql指令中間沒辦法在接受以"#"為開頭的註解,這點使用上必須注意。
另外需要注意的是,有時用我們在r裡面的變項名稱會用a.column1的方式來命名,但是這樣的變項名稱若在sqldf指令中使用時,會與原本sql內建的" . "混淆(sql中"a.b"的意思代表來自a表格中的b欄),在欄位命名上需要注意這點。
圖片來源:http://digijustin.com/2007/03/23/selecting-random-row-from-a-sql-select-statement/
沒有留言:
張貼留言