言語処理100本ノック第2章 : UNIXコマンドの基礎

2015年10月28日 | ブログラミング

東北大学の乾・岡崎研究室で公開されている言語処理100本ノック（2015年版）http://www.cl.ecei.tohoku.ac.jp/nlp100/ を、R言語で解く。

同趣旨のページ https://rpubs.com/yamano357/85313 では，

library(dplyr)
library(stringr)
library(stringi)

なんかを使っているんだけど，かえって面倒くさくなっているように見受けられる（ご本人はキレイだと思っているんだろうなぁ）。

そこで，特別なパッケージなど使わずに，基本関数だけで書く（関数のネストって，キレイだと思うんだけどなぁ）。

なお，ファイルの内容により，何通りも書き方はあるので，その 1 例と言うことで（必ずしも最適解ではない可能性がある）

10. 行数のカウント

行数をカウントせよ。確認には wc コマンドを用いよ。

command line:
wc -l hightemp.txt

R:
length(readLines("hightemp.txt"))
nrow(read.table("hightemp.txt", header=FALSE))

AWK one liner:
gawk 'END{print NR}' hightemp.txt

11. タブをスペースに置換

タブ 1 文字につきスペース 1 文字に置換せよ．確認には sed コマンド，tr コマンド，もしくは expand コマンドを用いよ．

command line:
tr '\t' ' ' < hightemp.txt

R:
gsub("\t", " ", readLines("hightemp.txt"))

AWK one liner:
gawk '{gsub("\t", " ", $0);print}' hightemp.txt

12. 1 列目を col1.txt に，2 列目を col2.txt に保存

各行の 1 列目だけを抜き出したものを col1.txt に，2 列目だけを抜き出したものを col2.txt としてファイルに保存せよ。確認には cut コマンドを用いよ。

command line:
cut -f 1 hightemp.txt > col1.txt
cut -f 2 hightemp.txt > col2.txt

R:
d = read.table("hightemp.txt", header=FALSE, as.is=TRUE)
write(d[,1], "col1.txt")
write(d[,2], "col2.txt")

AWK one liner:
gawk '{print $1 > "col1.txt"; print $2 > "col2.txt"}' hightemp.txt

13. col1.txt と col2.txt をマージ

12 で作った col1.txt と col2.txt を結合し，元のファイルの 1 列目と 2 列目をタブ区切りで並べたテキストファイルを作成せよ。確認には paste コマンドを用いよ。

command line:
paste col1.txt col2.txt > merge.txt

R:
write(paste(readLines("col1.txt"), readLines("col2.txt"), sep="\t"), "merge.txt")

AWK one liner:
gawk '{getline a < "col2.txt"; print $0, a}' col1.txt > merge.txt

14. 先頭から N 行を出力

自然数 N をコマンドライン引数などの手段で受け取り，入力のうち先頭のN行だけを表示せよ。確認には head コマンドを用いよ。

command line:
head -5 hightemp.txt

R:
readLines("hightemp.txt", 5)

AWK one liner:
gawk -v N=5 'FNR < N' hightemp.txt

15. 末尾の N 行を出力

自然数 N をコマンドライン引数などの手段で受け取り，入力のうち末尾のN行だけを表示せよ。確認には tail コマンドを用いよ。

command line:
tail -5 hightemp.txt

R:
tail(readLines("hightemp.txt"), 5)

AWK one liner:
gawk -v N=6 '{a[NR]=$0} END {for (i = NR-N+1; i <= NR; i++) print a[i]}' hightemp.txt

16. ファイルを N 分割する

自然数 N をコマンドライン引数などの手段で受け取り，入力のファイルを N 行ずつのファイルに分割せよ。この処理を split コマンドで実現せよ。

command line: ファイル名は順次 part-a, part-b, ... となる
split -a 1 -l 12 hightemp.txt part-

R: ファイル名は順次 part-1, part-2, ... となる
d = readLines("hightemp.txt")
N = 12
no = 0
for (i in seq_along(d)) {
if ((i-1)%%N == 0) {
   no = no+1
   fn = sprintf("part-%i", no)
   APPEND = FALSE
}
write(d[i], file=fn, sep="", append=APPEND)
APPEND = TRUE
}

AWK one liner: ファイル名は順次 part-1, part-2, ... となる
awk -v N=12 '{m=(NR-1)/N; if (m == int(m)) fn="part-" ++no; print $0 > fn}' hightemp.txt

17. １列目の文字列の異なり

1 列目の文字列の種類（異なる文字列の集合）を求めよ。確認には sort, uniq コマンドを用いよ。

command line:
cut -f 1 hightemp.txt | sort | uniq
cut -f 1 hightemp.txt | sort | uniq | wc -w

R:
d = read.table("hightemp.txt", header=FALSE, as.is=TRUE)
sort(unique(d[,1]))
length(sort(unique(d[,1])))

AWK one liner:
gawk '{a[$1]} END {for (i in a) print i}' hightemp.txt
gawk '{a[$1]} END {for (i in a) sum++; print sum}' hightemp.txt

18. 各行を 3 コラム目の数値の降順にソート

各行を 3 コラム目の数値の逆順で整列せよ（注意: 各行の内容は変更せずに並び替えよ）。確認には sort コマンドを用いよ（この問題はコマンドで実行した時の結果と合わなくてもよい）。

command line:
sort -r -n -k 3 hightemp.txt

R:
d = readLines("hightemp.txt")
d[order(sapply(d, function(s) unlist(strsplit(s, "\t"))[3]), decreasing=TRUE)]

AWK one liner:
不向き

19. 各行の 1 コラム目の文字列の出現頻度を求め，出現頻度の高い順に並べる

各行の 1 列目の文字列の出現頻度を求め，その高い順に並べて表示せよ。確認には cut, uniq, sort コマンドを用いよ。

cut -f 1 hightemp.txt | sort | uniq -c | sort -r

d = read.table("hightemp.txt", header=FALSE, as.is=TRUE)
sort(table(d[,1]), decreasing=TRUE)

AWK one liner
gawk '{a[$1]++} END{for (i in a) print i, a[i] | "sort -r -k 2"}' hightemp.txt

2024年11月
日	月	火	水	木	金	土
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

裏 RjpWiki

Julia ときどき R, Python によるコンピュータプログラム，コンピュータ・サイエンス，統計学

言語処理100本ノック 第2章 : UNIXコマンドの基礎

このブログの人気記事

コメントを投稿

「ブログラミング」カテゴリの最新記事

プロフィール

最新記事

バックナンバー

カレンダー

カテゴリー

最新コメント

雨雲の動き

ログイン

goo blog お知らせ

goo blog おすすめ

言語処理100本ノック第2章 : UNIXコマンドの基礎