先來做一個簡單的corpus:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
text <- c("今天天氣很好1","今天天氣很好2") | |
d.corpus <- Corpus(VectorSource(text)) | |
inspect(d.corpus) | |
''' | |
<<VCorpus (documents: 2, metadata (corpus/indexed): 0/0)>> | |
[[1]] | |
<<PlainTextDocument (metadata: 7)>> | |
今天天氣很好1 | |
[[2]] | |
<<PlainTextDocument (metadata: 7)>> | |
今天天氣很好2 | |
''' |
(熟悉tm套件的人應該有發現現在物件結構長得跟以前不太一樣,但是結構的事晚點再看,以下先來示範如果用以前的程式碼會發生什麼事情)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
d.corpus <- tm_map(d.corpus, function(x) gsub("今天","昨天",x)) | |
inspect(d.corpus) | |
<<VCorpus (documents: 2, metadata (corpus/indexed): 0/0)>> | |
[[1]] | |
[1] 昨天天氣很好1 | |
[[2]] | |
[1] 昨天天氣很好2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
tdm = as.matrix(TermDocumentMatrix(d.corpus)) | |
''' | |
Error in UseMethod("meta", x) : | |
no applicable method for 'meta' applied to an object of class "character" | |
In addition: Warning message: | |
In mclapply(unname(content(x)), termFreq, control) : | |
all scheduled cores encountered errors in user code | |
''' |
原因在於0.6版的tm套件改變了包裝文件的方式...
我們先來看一下新的文件結構長什麼樣子:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
> str(d.corpus) | |
List of 2 | |
$ 1:List of 2 | |
..$ content: chr "今天天氣很好1" | |
..$ meta :List of 7 | |
.. ..$ author : chr(0) | |
.. ..$ datetimestamp: POSIXlt[1:1], format: "2014-09-19 16:12:28" | |
.. ..$ description : chr(0) | |
.. ..$ heading : chr(0) | |
.. ..$ id : chr "1" | |
.. ..$ language : chr "en" | |
.. ..$ origin : chr(0) | |
.. ..- attr(*, "class")= chr "TextDocumentMeta" | |
..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument" | |
$ 2:List of 2 | |
..$ content: chr "今天天氣很好2" | |
..$ meta :List of 7 | |
.. ..$ author : chr(0) | |
.. ..$ datetimestamp: POSIXlt[1:1], format: "2014-09-19 16:12:28" | |
.. ..$ description : chr(0) | |
.. ..$ heading : chr(0) | |
.. ..$ id : chr "2" | |
.. ..$ language : chr "en" | |
.. ..$ origin : chr(0) | |
.. ..- attr(*, "class")= chr "TextDocumentMeta" | |
..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument" | |
- attr(*, "class")= chr [1:2] "VCorpus" "Corpus" |
簡單來說就是現在包裝文件的方式變成像書本一樣,每個文件分別存放在一個獨立的textDocument的list裏面,美個textDucument又分別裝著content和meta檔,content裝著文件內容,meta裝著類似作者,標題,建立日期等一些關於文件說明的部分.
過去的tm_map功能能夠自動存取每個文件的內容,但是現在如果直接套用tm_map(如上面的範例)會強至將文件變成單純的chr,喪失文件屬性:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
> str(d.corpus) | |
List of 2 | |
Error in UseMethod("meta", x) : | |
no applicable method for 'meta' applied to an object of class "character" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
> d.corpus <- Corpus(VectorSource(text)) | |
> d.corpus <- tm_map(d.corpus, content_transformer(function(x) gsub("今天","昨天",x))) | |
> inspect(d.corpus) | |
<<VCorpus (documents: 2, metadata (corpus/indexed): 0/0)>> | |
[[1]] | |
<<PlainTextDocument (metadata: 7)>> | |
昨天天氣很好1 | |
[[2]] | |
<<PlainTextDocument (metadata: 7)>> | |
昨天天氣很好2 | |
> str(d.corpus) | |
List of 2 | |
$ 1:List of 2 | |
..$ content: chr "昨天天氣很好1" | |
..$ meta :List of 7 | |
.. ..$ author : chr(0) | |
.. ..$ datetimestamp: POSIXlt[1:1], format: "2014-09-19 16:35:11" | |
.. ..$ description : chr(0) | |
.. ..$ heading : chr(0) | |
.. ..$ id : chr "1" | |
.. ..$ language : chr "en" | |
.. ..$ origin : chr(0) | |
.. ..- attr(*, "class")= chr "TextDocumentMeta" | |
..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument" | |
$ 2:List of 2 | |
..$ content: chr "昨天天氣很好2" | |
..$ meta :List of 7 | |
.. ..$ author : chr(0) | |
.. ..$ datetimestamp: POSIXlt[1:1], format: "2014-09-19 16:35:11" | |
.. ..$ description : chr(0) | |
.. ..$ heading : chr(0) | |
.. ..$ id : chr "2" | |
.. ..$ language : chr "en" | |
.. ..$ origin : chr(0) | |
.. ..- attr(*, "class")= chr "TextDocumentMeta" | |
..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument" | |
- attr(*, "class")= chr [1:2] "VCorpus" "Corpus" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
> tdm = as.matrix(TermDocumentMatrix(d.corpus)) | |
> tdm | |
Docs | |
Terms 1 2 | |
昨天天氣很好1 1 0 | |
昨天天氣很好2 0 1 |
為了這個卡了半天, 直到看見這篇部落格,萬分感謝!!!
回覆刪除XD 原來這個問題現在還存在唷
刪除