2021年5月のブログ記事一覧-裏 RjpWiki

ggplot2 にも優る vega-lite

2021年05月31日 | ブログラミング

R の ggplot2 にも負けず劣らず（きっと勝ってるだろうけど）の VegaLite がんばれ。

ただ，機能豊富な故，何をどう指定すれば望むように表示できるかの情報が取りにくい。

今回は，散布図において，x, y 軸の範囲の指定についてわかったことを記す。

肝は，scale ={}。デフォルトでは原点を含む図になるので，場合によっては間抜けなグラフになる。実際のデータの範囲について上手い具合に描いてもらうには scale={zero=false} を指定する。あるいは，具体的に幾つから幾つまでと指定したいときは scale={domain=[最小値, 最大値]}} と指定する。

なお，グループ別に色や形を変えるには color= とか shape= で使用する変数を指定する。

using VegaLite, CSV, DataFrames
path = "/Users/foo/Desktop/Julia/npb.csv"
df = CSV.read(path, DataFrame)
df |>
@vlplot(
:point,
x={:身長, scale={zero=false}},
y={:体重, scale={domain=[60, 125]}},
color=:守備位置,
shape=:守備位置,
width=400,
height=400
)

頑張ってます。クリックお願いします。

東京オリンピック開催中止！！

2021年05月29日 | 雑感

私は，東京オリンピック開催中止！を訴えます

頑張ってます。クリックお願いします。

東京オリンピック開催中止！！　菅総理大臣総理

2021年05月29日 | 雑感

さっさと　決断しなされ

もう，命運は尽きた

頑張ってます。クリックお願いします。

Julia でデータ解析--002　プロ野球選手の体格，体重は身長の何乗に比例するのか？

2021年05月29日 | ブログラミング

include("nonlinearfitting.jl"); # Julia に翻訳--215　非線形回帰に掲示

using CSV, DataFrames, FreqTables, StatsPlots

df = CSV.read("npb.csv", DataFrame);
function age(y, m, d)
res = 2021 - y
if m > 4 || (m == 4 && d > 1)
res -= 1
end
res
end

using Query

df2 = df |> @mutate(年齢 = age(_.生年, _.月, _.日),
BMI = _.体重 / (_.身長 / 100)^2) |>
@filter(_.区分 == "支配下選手") |> DataFrame;

using Plots

gdf = groupby(df, :守備位置);

3次元の直方体の物体の重さは，LxWxH に比重を掛ければ求まる。
取り上げている野球選手のデータではH（身長）がわかっているだけで，L,W はない。

体重 = a x 身長 ^ b としたらどうなるだろうか？

pyplot()
function powerfunction(df; group=df[1, 3])
x = df[!, :身長];
y = df[!, :体重];
obj = nonlinearfitting(x, y, model="Power1");
a, b = obj[:p]
plotfittedresult(obj)
scatter!(x, y, smooth=true, color=:black,
xlims=(166, 202), ylims=(64, 121))
title!("$group 体重 = $(round(a, digits=5)) * 身長 ^ $(round(b, digits=3))",
thickness_scaling = 0.7)
end

plt0 = powerfunction(df2, group="全選手")
savefig("fig1.png")

黒は回帰直線。赤が「体重 = a x 身長 ^ b」。b = 2.021 で二次曲線に極めて近いが，データの範囲内では回帰直線とほとんど差がない。

守備位置ごとに見てみると b の値はかなり異なるが，これまた，データの範囲内では回帰直線とほとんど差がない。

plt1 = powerfunction(gdf[1]);
plt2 = powerfunction(gdf[2]);
plt3 = powerfunction(gdf[3]);
plt4 = powerfunction(gdf[4]);
plot(plt1, plt2, plt3, plt4)
savefig("fig2.png")

b が一番大きいのは内野手で 2.507，一番小さいのは捕手で 1.542。この差は何なのだろうかな？
L, W, H の関係の違いかな。Julia でデータ解析--001 で書いたように，捕手は小柄で，身長と体重の相関係数も一番低い，ということが関係しているのかな。

頑張ってます。クリックお願いします。

暗号文を解読せよ

2021年05月29日 | ブログラミング

ヒント：単なるシーザー暗号ではない 2021/06/01 追記tjbfnjcmloikgzcsifsvygvncjbfahhclfnifjbucatjyhcvboownjciawilbohbzdbvawjuqjuqzpjzkwiswwlfofoikrzkwxhhzkhjavzwfjwcnphdvboooohzgtsihfzjfzhhzkslbogucrdsvyszuuvnsypbvnfzhhxpjdskvyhzzhduurosoosmavvabvawjucmhbtuoopcizcxvbxlwqlrvurnvrzkwxhhzkqvuzjuuzurpysrlomlazacihumlooiooazzmwzsrjmhchhrhfrlvvcsxvazacylrdjoolokvfopcivtoooomwzsrvzoapbvsfzzhduuksoxltjyhcvgzdvjosmluvcsoosdyzdcsnavvahchhihhdvbhpucazdcsdawnhzovuzavzytdahduuvurkycklfoooodsnocpsryvhcpgwbhduoghfblfnlbnlkzjoiucoksypqvasrlqvubjaqjugzjfvasrlqvubjavvszjdhcpgbycpurooswyoqlazuzdcwinoikrzhrrocnafpnuglrclfzooqlqjugzjfvasyphahfvicqlcpydjvfkvkzyhjhryvfylhmhqoavzdcmsrrpzgswoazzucolbjyzjuumlaztpzykchhrlgvfvzyswbhdaqvubzcsmmcmnsodvvahclmyprclfzphdztjyinavzswqpbbyooosmacwlrzkwxhhzkvzysovhcliimwipgclrrvffdvdjvoostdvjmcpnvoosmlvvcsooinmomzcivpgfoycoijsyphdzfvavzytjyinacwlvzysylrdjoolrovhclumlooaonrfztoduwinpzmcmlinavvatmvaoosnlvjucmlryloydsohyzpbxysvzsyksqvhdvbovhchhxhinltjykcpqcavzfuvcsoosghgomigsazhgpysjmrzccopciavvakzosmlvdnvgffzzcgcsooooavzzsyloyzvvszivhchjzkwzkwicoduhchhoownuoopcibbylfbvrnoogsvvcsvusriwmavjmtmlsyvavurooooncqlfitsiacaavzwsjwzzimoosklckssavfoosklckssnoogsbjadzywnotmvaooszhfoo

回答は，この記事のコメントとして投稿してください。

頑張ってます。クリックお願いします。

Julia でマークダウンファイルから html ファイルを生成する

2021年05月27日 | ブログラミング

頑張ってます。クリックお願いします。

Julia でデータ解析--001　プロ野球選手の体格，捕手といえばドカベン？

2021年05月26日 | ブログラミング

using CSV, DataFrames, FreqTables, StatsPlots

2021 年度のプロ野球選手のポジション，生年月日，身長，体重のデータ
https://npb.jp/bis/teams/

df = CSV.read("npb.csv", DataFrame);

年齢計算関数（2021/04/01 現在）
function age(y, m, d)
res = 2021 - y
if m > 4 || (m == 4 && d > 1)
res -= 1
end
res
end

年齢の計算と，"支配下選手" をフィルタリング
using Query
df2 = df |> @mutate(年齢 = age(_.生年, _.月, _.日),
BMI = _.体重 / (_.身長 / 100)^2) |>
@filter(_.区分 == "支配下選手") |> DataFrame

年齢の分布
res = freqtable(df2[!, :年齢]);

using Plots
pyplot(grid=false, label="")
bar(names(res), res, xlabel="年齢", ylabel="人", tick_direction=:out)

BMI の分布
histogram(df2[!, :BMI], xlabel="BMI", ylabel="人", tick_direction=:out)

守備範囲でグループ化
gdf = groupby(df2, :守備位置);
using Statistics

身長，体重，BMI の分布
combine(gdf, :身長 => mean, :身長 => std,
:体重 => mean, :体重 => std,
:BMI => mean, :BMI => std,
[:身長, :体重] => cor)

# 4 rows × 8 columns
# 守備位置身長_mean 身長_std 体重_mean 体重_std BMI_mean BMI_std 身長_体重_cor
# String Float64 Float64 Float64 Float64 Float64 Float64 Float64
# 1 外野手 180.55 5.50386 85.0643 8.93076 26.0627 2.10372 0.640489
# 2 投手 181.998 5.91158 86.0314 8.15001 25.9529 1.86002 0.649159
# 3 内野手 179.062 5.85431 83.7416 10.829 26.0598 2.58367 0.652952
# 4 捕手 177.62 4.09284 84.6962 6.15474 26.8427 1.69953 0.51652

身長と体重の散布図の描画
function correlation(df; group=df[1, 3])
plt = scatter(df[!, :身長], df[!, :体重],
markercolor=:blue, markerstrokecolor=:blue,
markeralpha=0.3, markerstrokealpha=0.3,
markersize=3, smooth=true,
xlabel="身長", xlims=(166, 202),
ylabel="体重", ylims=(64, 121),
tick_direction=:out)
title!("$group r = $(string(round(cor(df[!, :身長], df[!, :体重]), digits=3)))")
plt
end

extrema(df2[!, :身長]) # (167, 201)
extrema(df2[!, :体重]) # (65, 120)
extrema(df2[!, :BMI]) # (20.37037037037037, 36.15702479338843)

plt0 = correlation(df2, group="全選手")
plt1 = correlation(gdf[1]);
plt2 = correlation(gdf[2]);
plt3 = correlation(gdf[3]);
plt4 = correlation(gdf[4]);
plot(plt1, plt2, plt3, plt4)
savefig("fig1.png")

BMI のヒストグラム描画
function distributionofbmi(df; group=df[1, 3])
bmi = floor.(Int, df[!, :BMI] ./ 0.5) .* 0.5
res2 = freqtable(bmi)
plt = histogram(df[!, :BMI], bins=15, alpha=0.8,
xlims=(20, 37), xlabel="BMI",
ylims=(0, 0.35), ylabel="PDF", normalize=:pdf,
tick_direction=:out)
title!("$group Mean = $(string(round(mean(df[!, :BMI]), digits=3)))")
plt
end

plt0 = distributionofbmi(df2, group="全選手")
plt1 = distributionofbmi(gdf[1]);
plt2 = distributionofbmi(gdf[2]);
plt3 = distributionofbmi(gdf[3]);
plt4 = distributionofbmi(gdf[4]);
plot(plt1, plt2, plt3, plt4)
savefig("fig2.png")

いくつかわかったこと

捕手は比較的小柄な選手が多い。身長は 190 センチ未満，体重は 100 キログラム未満。いわゆる漫画のドカベンタイプ（大男）のイメージではない。しかし，BMI の平均値は 26.843 と最も高いので，小さいがずんぐりしているのかもしれない。
身長と体重の相関係数は，捕手が 0.517 と一番低い。内野手，投手，外野手では 0.65 程度である。

頑張ってます。クリックお願いします。

Julia に翻訳--239　PISA 2018データを読む，SPSS の .sav ファイル

2021年05月26日 | ブログラミング

#==========
Julia の修行をするときに，いろいろなプログラムを書き換えるのは有効な方法だ。
以下のプログラムを Julia に翻訳してみる。

PISA 2018データを読む
https://oku.edu.mie-u.ac.jp/~okumura/python/pisa2018.html

ファイル名: .jl　　関数名:

翻訳するときに書いたメモ

途中で NamedArray のまま処理を進めるようにしたが，データフレームにして transform などを使ってデータ処理をしたほうがわかりやすいかもしれないが，まあいいか。

==========#

#==
>>> import pyreadstat
>>> from time import time
>>> s = time()
df, meta = pyreadstat.read_sav('STU/CY07_MSU_STU_QQQ.sav')
print(time()-s)
# 92.55421423912048 sec.
==#

using StatFiles, DataFrames, CSV
# @time df = DataFrame(load("STU/CY07_MSU_STU_QQQ.sav"))

とてつもなく遅い（4時間ぐらい！！！かかった）
R か Python で読んで，CSV ファイルの保存してから Julia でやるほうがよい。

@time CSV.write("STU/CY07_MSU_STU_QQQ.csv", df) # 255.005749 seconds
@time df = CSV.read("STU/CY07_MSU_STU_QQQ.csv", DataFrame) # 67.728522 seconds。 2回目以降は早くなる 46.337307 seconds
Base.summarysize(df)　# 5,464,156,768 byte
size(df) # (612004, 1119)
first(df, 2)
# 2 rows × 1,119 columns (omitted printing of 1110 columns)
# CNTRYID CNT CNTSCHID CNTSTUID CYC NatCen STRATUM SUBNATIO OECD
# Float64 String Float64 Float64 String Int64 String Int64 Float64
# 1 8.0 ALB 800002.0 800251.0 07MS 800 ALB0109 80000 0.0
# 2 8.0 ALB 800002.0 800402.0 07MS 800 ALB0109 80000 0.0
names(df)
# 1119-element Vector{String}:
# "CNTRYID"
# "CNT"
# "CNTSCHID"
# "CNTSTUID"
# "CYC"
# ⋮
# "SENWT"
# "VER_DAT"
# "test"

#> "IC001Q01TA" の回答の度数分布を表示する：

using FreqTables
freqtable(df[!, :IC001Q01TA])
# 4-element Named Vector{Int64}
# Dim1 │
# ────────┼───────
# 1.0 │ 207182
# 2.0 │ 50529
# 3.0 │ 97676
# missing │ 256617

#> 国名と回答の度数分布

freqtable(df[!, :CNT], df[!, :IC001Q01TA])
# 80×4 Named Matrix{Int64}
# Dim1 ╲ Dim2 │ 1.0 2.0 3.0 missing
# ────────────┼───────────────────────────────────
# ALB │ 3940 558 1589 272
# ARE │ 0 0 0 19277
# ARG │ 0 0 0 11975
# AUS │ 6858 2406 2738 2271
# AUT │ 4797 626 972 407
# USA │ 2577 843 1183 235
# VNM │ 0 0 0 5377

#> 日本に限れば

gdf = groupby(df, :CNT)
jpn = gdf[(CNT="JPN",)]
freqtable(jpn[!, :IC001Q01TA])
# 4-element Named Vector{Int64}
# Dim1 │
# ────────┼─────
# 1.0 │ 2200
# 2.0 │ 1790
# 3.0 │ 2029
# missing │ 90

#> 国ごとに回答1の割合を求めるには

# クロス集計
tbl = freqtable(df[!, :CNT], df[!, :IC001Q01TA])
# 80×4 Named Matrix{Int64}
# Dim1 ╲ Dim2 │ 1.0 2.0 3.0 missing
# ────────────┼───────────────────────────────────
# ALB │ 3940 558 1589 272
# ARE │ 0 0 0 19277
# ARG │ 0 0 0 11975
# ⋮ ⋮ ⋮ ⋮ ⋮
# USA │ 2577 843 1183 235
# VNM │ 0 0 0 5377

# 全てが missing の国を除く
tbl2 = tbl[((tbl[:, 1] .!= 0) .& (tbl[:, 2] .!= 0) .& (tbl[:, 3] .!= 0)), :]
# 53×4 Named Matrix{Int64}
# Dim1 ╲ Dim2 │ 1.0 2.0 3.0 missing
# ────────────┼───────────────────────────────────
# ALB │ 3940 558 1589 272
# AUS │ 6858 2406 2738 2271
# AUT │ 4797 626 972 407
# ⋮ ⋮ ⋮ ⋮ ⋮
# URY │ 2197 556 1351 1159
# USA │ 2577 843 1183 235

# missing を除いたサンプルサイズ
rowsum = sum(tbl2[:, 1:3], dims=2)
# 53×1 Named Matrix{Int64}
# Dim1 ╲ Dim2 │ sum(Dim2)
# ────────────┼──────────
# ALB │ 6087
# AUS │ 12002
# AUT │ 6395
# ⋮ ⋮
# URY │ 4104
# USA │ 4603

# 回答1の割合
s = vec(tbl2[:, 1] ./ rowsum)
# 53-element Vector{Float64}:
# 0.647281090849351
# 0.5714047658723546
# 0.7501172791243159
# ⋮
# 0.5353313840155945
# 0.559852270258527

# 国名取り出し
country = names(tbl2)[1]

# 並べ替え順（R の order()）
o = sortperm(s)

# 3文字略称を日本語名に変換
countryname = CSV.read("countryname.csv", DataFrame);
name = fill("", length(s));
for (i, abb3) in enumerate(country[o])
j = indexin([abb3], countryname[!, :abb3])[1]
name[i] = isnothing(j) ? abb3 : countryname[j, :nameJ]
end

# ドットプロット
using Plots
pyplot()
plt = scatter(
s[o],
country[o],
tick_direction=:out,
size=(500, 700),
label="")
yticks!(0.5:length(name), name)
savefig("fig.png")

頑張ってます。クリックお願いします。

Julia に翻訳--238　SPSS ファイルを読む

2021年05月25日 | ブログラミング

#==========
Julia の修行をするときに，いろいろなプログラムを書き換えるのは有効な方法だ。
以下のプログラムを Julia に翻訳してみる。

TALIS 2018
https://oku.edu.mie-u.ac.jp/~okumura/python/talis2018.html

ファイル名: .jl　　関数名:

翻訳するときに書いたメモ

==========#

# import Pkg; Pkg.add("StatFiles")
using StatFiles, DataFrames, CSV

StatFiles で，Stata, SPSS, SAS ファイルを DataFrame へ読み込むことができる

@time df = DataFrame(load("SPSS_2018_international/BTGINTT3.sav"))

かなり処理時間が掛かる 610.987733 seconds 別の機会にやったら 2008.776553 seconds。これは，どげんかせんといかん
153,682 rows × 493 columns だったが？

長時間がかかったので，csv ファイルに書きだしておく。

@time CSV.write("SPSS_2018_international/BTGINTT3.csv", df)
# 35.790314 seconds
@time df = CSV.read("SPSS_2018_international/BTGINTT3.csv", DataFrame)
# 24.544577 seconds 毎回 *.sav ファイルを読むなどと言う愚行は避けるべし

CNTRY 別のグループデータフレームにする

@time gdf = groupby(df, :CNTRY)
# 0.134826 seconds

国別の行数

combine(gdf, nrow)

日本の行数

using FreqTables
freqtable(df[!, :CNTRY])["JPN"] # 3555

日本のデータを取り出す
クエリーで直接データフレームを処理することもできるが，

using Query
@time jpn = df |> @filter(_.CNTRY == "JPN") |> DataFrame
# 86.142237 seconds 遅い！！

TT3G34M への回答状況（日本）

freqtable(jpn[!, :TT3G34M])
#===
5-element Named Vector{Int64}
Dim1 │
────────┼─────
1.0 │ 835
2.0 │ 1478
3.0 │ 784
4.0 │ 428
missing │ 30
===#

groupby() に引き続いての処理時間でも，こちらの方が速い

@time jpn = gdf[(CNTRY="JPN",)];
# 0.468942 seconds

TT3G34M への回答状況（日本）

freqtable(jpn[!, :TT3G34M])
#===
5-element Named Vector{Int64}
Dim1 │
────────┼─────
1.0 │ 835
2.0 │ 1478
3.0 │ 784
4.0 │ 428
missing │ 30
===#

TT3G34M への回答状況（各国）

freqtable(df[!, :CNTRY], df[!, :TT3G34M])
#===
47×5 Named Matrix{Int64}
Dim1 ╲ Dim2 │ 1.0 2.0 3.0 4.0 missing
────────────┼────────────────────────────────────────────
ABA │ 143 595 691 613 57
ARE │ 135 882 2227 5170 234
AUS │ 39 755 1291 1155 333
AUT │ 302 1581 1242 1006 124
BEL │ 688 1441 1584 1343 201
⋮ ⋮ ⋮ ⋮ ⋮ ⋮
VNM │ 186 1227 1483 920 9
ZAF │ 538 493 441 527 47
===#

国ごとの平均値（欠損値を除いて）

using Statistics
df2 = df[completecases(df, [:CNTRY, :TT3G34M]), :]
gdf2 = groupby(df2, :CNTRY)
result = combine(gdf2, nrow, :TT3G34M => mean)
#===
46 rows × 3 columns

CNTRY nrow TT3G34M_mean
String Int64 Float64
1 ABA 2042 2.86876
2 ARE 8414 3.47754
3 AUS 3240 3.09938
4 AUT 4131 2.7146
5 BEL 5056 2.70847
6 BGR 2821 3.00425
7 BRA 2371 2.90468
8 CAB 974 3.19713
9 CHL 1930 3.1658
10 COL 2354 3.18182
:
===#

result2 = sort(result, :TT3G34M_mean)
#===
46 rows × 3 columns

CNTRY nrow TT3G34M_mean
String Int64 Float64
1 JPN 3525 2.22837
2 ZAF 1999 2.47874
3 FRA 2846 2.47892
4 EST 2930 2.66689
5 TWN 3811 2.70323
6 BEL 5056 2.70847
7 SVN 2046 2.71017
8 AUT 4131 2.7146
9 HRV 3269 2.72316
10 ISR 2349 2.73052
:
===#

図を描く

using Plots
pyplot()
scatter(result2[!, :TT3G34M_mean], result2[!, :CNTRY],
tick_direction=:out, size=(500, 600), xlims=(1, 4), label="")
countryname = CSV.read("countryname.csv", DataFrame);
name = fill("", length(result2[!, :CNTRY]));
for (i, abb3) in enumerate(result2[!, :CNTRY])
j = indexin([abb3], countryname[!, :abb3])[1]
name[i] = isnothing(j) ? abb3 : countryname[j, :nameJ]
end
yticks!(0.5:length(name), name)
savefig("fig.png")

頑張ってます。クリックお願いします。

Julia に翻訳--237　モンテカルロ法でeを求める

2021年05月22日 | ブログラミング

#==========
Julia の修行をするときに，いろいろなプログラムを書き換えるのは有効な方法だ。
以下のプログラムを Julia に翻訳してみる。

モンテカルロ法でeを求める
https://oku.edu.mie-u.ac.jp/~okumura/python/e-montecarlo.html

ファイル名: montecarlo.jl　　関数名:

翻訳するときに書いたメモ

==========#
#=====
一つの関数にしてJIT化する：
@jit(nopython=True)
def main():
def ne():
s = 0
i = 0
while s < 1:
s += random.random()
i += 1
return i
n = 100000000
t = 0
for _ in range(n):
t += ne()
print(t / n)

%time main()

2.23s になった。
=====#

システムはとても古くて，M1 Mac mini には及びもつかないが

macOS Catalina
vergion 10.15.7
MacBook Pro(Retina, mid 2012)
processor 2.7 GHz quadcore Intel Core i7
mamory 16GB 1600 MHz DDr3

Julia だと,Python とほとんど同じコーディングで

using Statistics

function test()
function ne()
s = 0
i = 0
while s < 1
s += rand()
i += 1
end
i
end
n = 100000000
t = 0
for i = 0:n-1
t += ne()
end
println(mean(t)) # 2.71824778
end

@time test()
#==
2.71822072e8
3.672514 seconds (36 allocations: 1.109 KiB)
==#

いい勝負ですよ。お金を掛けなくても，Julia に乗り換えるだけで，M1 Mac に太刀打ちできる？？？？

頑張ってます。クリックお願いします。

Julia に翻訳--236　十進多倍長計算

2021年05月22日 | ブログラミング

#==========
Julia の修行をするときに，いろいろなプログラムを書き換えるのは有効な方法だ。
以下のプログラムを Julia に翻訳してみる。

十進多倍長計算
https://oku.edu.mie-u.ac.jp/~okumura/python/decimal.html

ファイル名: 十進多倍長計算.jl　　関数名:

翻訳するときに書いたメモ

==========#

#> Excelに =0.3-0.2-0.1 と打ち込んでみてください。当然ながら（？）ぴったりゼロになります。

0.3-0.2-0.1　# -2.7755575615628914e-17

#> Excelの計算結果としては，この -2.77556E-17 のほうが正しいのです

#> AppleのNumbersという表計算ソフトの最新版では =74724506/23785549-PI() と打ち込むとぴったり -2.22712210211597E-15 になります

#> Pythonでは decimal というパッケージを使えば任意精度の10進演算ができます。
#> 中略
#> 74724506 / 23785549 - π の値を求めてみましょう：
#> 中略
#> 結果が -2.22712210211597472...E-15 のように出力されます。

Julia では

74724506/23785549-π　# -2.220446049250313e-15　？？

以下のようにすれば
big(74724506) / big(23785549) - π # -2.227122102115974720471985452108342639006776121601087390928614849464470859988176e-15

#> おまけ）1 / 0.9899 を計算して，小数点以下を2桁ずつ区切って読んでみてください。フィボナッチ数列になっています
（10000/9899 を計算するとなぜ商 (1.0102030508) にフィボナッチ数列が現れるのですか？）\https://jp.quora.com/10000-9899-%E3%82%92%E8%A8%88%E7%AE%97%E3%81%99%E3%82%8B%E3%81%A8%E3%81%AA%E3%81%9C%E5%95%86-1-0102030508-%E3%81%AB%E3%83%95%E3%82%A3%E3%83%9C%E3%83%8A%E3%83%83%E3%83%81%E6%95%B0%E5%88%97%E3%81%8C%E7%8F%BE%E3%82%8C

1 / big"0.9899" # 1.01020305081321345590463683200323264976260228305889483786241034447924032730578

1.01 02 03 05 08 13 21 34 55 90463683200323264976260228305889483786241034447924032730578

頑張ってます。クリックお願いします。

Julia に翻訳--235　PISA「盗難事件」問題のシミュレーション　二項分布

2021年05月22日 | ブログラミング

#==========
Julia の修行をするときに，いろいろなプログラムを書き換えるのは有効な方法だ。
以下のプログラムを Julia に翻訳してみる。

PISA「盗難事件」問題のシミュレーション
https://oku.edu.mie-u.ac.jp/~okumura/python/pisa-robbr.html

ファイル名: PISA.jl　　関数名:

翻訳するときに書いたメモ

==========#

#> 全部で1024件の盗難事件が1998年・1999年の2年間に起こった場合，どちらの年も盗難の起きやすさは同じと仮定して（帰無仮説），両者の件数が8件以上異なる確率（p 値）をシミュレーションで求める。

#> import numpy as np

#> rng = np.random.default_rng()
#> a = rng.integers(0, 2, size=1024)

#> 両年の件数の差は 2 * abs(sum(a) - 512) である。これが8以上なら True，8未満なら False を返す関数を作ってみる。

#> def f():
#> a = rng.integers(0, 2, size=1024)
#> return 2 * abs(sum(a) - 512) >= 8

#> これの返す値を要素とする長さ10000の配列を作り，True の割合を調べる。

#> b = [f() for _ in range(10000)]
#> sum(b) / 10000

なにか，難しそう。

#> rng = np.random.default_rng()
#> a = rng.integers(0, 2, size=1024)

って？

要するに，ランダムな 0/1 の二値データを 1024 個発生させる？0/1 の差は 2 * abs(sum(a) - 512)

1024 個のランダムな 0/1 二値データ（0 も 1 も同じ確率である）は，

a = rand(0:1, 1024)

で生成される。

1 であるのは

sum(a) # sum(a .== 1) でもあるが

0 であるのは

1024 - sum(a) # sum(a .== 0)

この差は

sum(a) - (1024 - sum(a))

すなわち

2 * sum(a) - 1024

絶対値を取ると

abs(2 * sum(a) - 1024)

すなわち

2 * abs(sum(a) - 512)

> これが8以上なら True，8未満なら False を返す関数

というのも，なぜ 8 以上か未満かというのもいきなりなので，まずは 2 * abs(sum(a) - 512) の分布を調べてみよう。

1024 個のランダムな 0/1 データで，0 と 1 の個数の差の絶対値は，

f() = 2 * abs(sum(rand(0:1, 1024)) - 512) # a = rand(0:1, 1024)

これを 10000 回調べてみよう。

using Random; Random.seed!(123)
results = [f() for i = 1:10000];

results の分布は，たとえば

maxx = maximum(results)

0〜maxxの範囲にある（偶数しかとらないが）。

集計すると

frequency = zeros(maxx+1);
[frequency[i+1] += 1 for i in results];

using Plots
bar(frequency, label="")
vline!([8], label="")
savefig("fig.png")

まあ，差がぴったり0というのはさすがにそう多くはないが差が 20 以内というのはけっこうある。
では，差が 8 以上というのがどれくらいあるかというと

]frequency[1:8]
sum(frequency[1:8]) # 1693.0
sum(results .<= 6) # 1693
sum(results .>= 8) # 8307

今の場合は 8307 回であった。つまり 10000 回のうちの 83.07% ということで，「差が 8 以上」ということは，よくあることということになる。

#> この値を p 値という。

頑張ってます。クリックお願いします。

Julia に翻訳--234　メキシカンハット

2021年05月21日 | ブログラミング

#==========
Julia の修行をするときに，いろいろなプログラムを書き換えるのは有効な方法だ。
以下のプログラムを Julia に翻訳してみる。

メキシカンハット
https://oku.edu.mie-u.ac.jp/~okumura/python/mexican.html

ファイル名: mexicanhat.jl　　関数名:

翻訳するときに書いたメモ

バックエンドにより，結果が若干異なる（インタラクティブとか設定とか）

==========#

using Plots
pyplot()

function hat(x, y)
r = @. x ^ 2 + y' ^ 2
@. exp(-r / 2) * (1 - r / 2) / π
end

n = 51
x = range(-4, 4, length=n);
y = range(-4, 4, length=n);
z = hat(x, y);

#> 等高線を描くには plt.contour を使います：

plt1 = contour(x, y, z)

その他3種

plt2 = wireframe(x, y, z)
plt3 = surface(x, y, z)
i = 26
plt4 = plot(x, z[i, :], label="")
for i = 27:51
plot!(x, z[i, :], label="")
end
plot(plt1, plt2, plt3, plt4)
savefig("fig2.png")

グリグリできる

plotly()
surface(x, y, z, alpha=0.7)

頑張ってます。クリックお願いします。

Julia に翻訳--233　ROC曲線とPR曲線

2021年05月21日 | ブログラミング

#==========
Julia の修行をするときに，いろいろなプログラムを書き換えるのは有効な方法だ。
以下のプログラムを Julia に翻訳してみる。

ROC曲線とPR曲線
https://oku.edu.mie-u.ac.jp/~okumura/python/roc-pr.html

ファイル名: roccurve.jl　　関数名:

翻訳するときに書いたメモ

for を使わず（リスト内包表記も使わず）ベクトル演算するようにすれば，短く簡潔に書ける。
Julia だと，ドット演算子記法なので，R に比べると煩わしい。

==========#

#> ROC曲線とPR曲線

#> ROC曲線をPythonで描いてみよう。まずデータ：

using Printf

s = [16,15,14,13,12,11,10, 9, 8, 8, 8, 8, 7, 6, 5]
t = [ 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0]

#> 真陽性率と偽陽性率を求める簡単な関数：

#> TP / (TP + FN) = sensitivity = recall
tpr(x) = sum(t[s .>= x]) / sum(t) # 一行で関数定義

#> FP / (FP + TN) = 1 - specificity
fpr(x) = sum(t[s .>= x] .== 0) / sum(t .== 0)

#> スコアに Inf をアペンドして，ソートしユニークなものだけ選ぶ：

u = sort(unique(float.(s)))
append!(u, Inf)

#> グラフを描く：

using Plots
pyplot( # 共通するパラメータはここで定義するとよい
grid=false,
aspect_ratio=1, # アスペクト比を1にする
xlims=(-0.05, 1.05),
ylims=(-0.05, 1.05),
marker=(:circle, 3, :red, stroke(:red)),
tick_direction=:out,
label="")

ROC = plot(
fpr.(u), # ベクトル演算に対応
tpr.(u),
xlabel = "False Positive Rate",
ylabel = "True Positive Rate")

#> どの点がどの閾値に対応するかも書き込みたければ次のようにする：

label = [@sprintf(" %g", x) for x in u] # text() 内には入れられない
annotate!(
fpr.(u),
tpr.(u),
text.(label, 10, :left, :bottom))

#> PR曲線

#> 真陽性率は，検診では感度 sensitivity，機械学習では再現率 recall とも呼ばれる。一方，検診で陽性適中率 positive predictive value と呼ばれる次の値は，機械学習では適合率 precision と呼ばれる：

#> TP / (TP + FP) = PPV = precision
ppv(x) = sum(t[s .>= x]) / sum(s .>= x)

#> 横軸に再現率，縦軸に適合率をとったグラフを，機械学習ではPR曲線（precision-recall curve）と呼び，よく使われる。

PR = plot(
tpr.(u), # ベクトル演算に対応
ppv.(u),
xlabel="Recall",
ylabel="Precision")

2枚の図を合成して，保存する。

plot(ROC, PR, size=(600, 300))
savefig("fig1.png")

頑張ってます。クリックお願いします。

Julia に翻訳--232　ヒストグラム

2021年05月20日 | ブログラミング

#==========
Julia の修行をするときに，いろいろなプログラムを書き換えるのは有効な方法だ。
以下のプログラムを Julia に翻訳してみる。

ヒストグラム
https://oku.edu.mie-u.ac.jp/~okumura/python/hist.html

ファイル名: ヒストグラム.jl　　関数名:

翻訳するときに書いたメモ

統計学においては最も基本的なグラフの一つである。それが故，「○○で描かれるヒストグラムはかっこ悪い」，「描くのが面倒だ」という批判もされやすい。
Julia で，まずまずのヒストグラムが描けただろうか？

==========#

using Plots
pyplot(grid=false, label="")

x = randn(1234);

histogram(x, tick_direction=:out)

histogram(x, color="lightgray", tick_direction=:out)

縦軸を何にするか，Julia では 4 通りある。

normalize: Bool or Symbol. Histogram normalization mode.
Possible values are:
false/:none (no normalization, default),
true/:pdf (normalize to a discrete Probability Density Function,
where the total area of the bins is 1),
:probability (bin heights sum to 1) and
:density (the area of each bin, rather than the height,
is equal to the counts - useful for uneven bin sizes).

histogram(x, tick_direction=:out, bg=:bisque, grid=true,
fg_color_grid=:green, ylabel="Frequency")

histogram(x, normalize=:pdf, ylabel="Pdf", tick_direction=:out,
bg=:bisque, grid=true, fg_color_grid=:green)
savefig("histogram.png")

histogram(x, normalize=:probability, ylabel="Probability",   　　　
          tick_direction=:out,
        bg=:bisque, grid=true, fg_color_grid=:green)

頑張ってます。クリックお願いします。

PVアクセスランキングにほんブログ村

プロフィール

フォロー中フォローするフォローする

バックナンバー

2024年07月

2024年06月

2024年05月

2024年04月

2024年03月

2024年02月

2024年01月

2023年12月

2023年11月

2023年10月

2023年09月

2023年08月

2023年07月

2023年06月

2023年05月

2023年04月

2023年03月

2023年02月

2023年01月

2022年12月

2022年11月

2022年10月

2022年09月

2022年08月

2022年07月

2022年06月

2022年05月

2022年04月

2022年03月

2022年02月

2022年01月

2021年12月

2021年11月

2021年10月

2021年09月

2021年08月

2021年07月

2021年06月

2021年05月

2021年04月

2021年03月

2021年02月

2021年01月

2020年12月

2020年11月

2020年10月

2020年09月

2020年08月

2020年07月

2020年06月

2020年05月

2020年04月

2020年03月

2020年02月

2020年01月

2019年12月

2019年11月

2019年09月

2019年08月

2019年07月

2019年06月

2019年05月

2019年04月

2018年12月

2018年09月

2018年08月

2018年07月

2018年06月

2018年05月

2018年03月

2018年02月

2018年01月

2017年12月

2017年11月

2017年10月

2017年09月

2017年08月

2017年07月

2017年06月

2017年05月

2017年04月

2017年03月

2017年02月

2017年01月

2016年12月

2016年11月

2016年10月

2016年09月

2016年08月

2016年07月

2016年06月

2016年05月

2016年04月

2016年03月

2016年02月

2016年01月

2015年12月

2015年11月

2015年10月

2015年09月

2015年08月

2015年07月

2015年06月

2015年05月

2015年04月

2015年03月

2015年02月

2015年01月

2014年12月

2014年11月

2014年10月

2014年09月

2014年08月

2014年07月

2014年06月

2014年05月

2014年04月

2014年03月

2014年02月

2014年01月

2013年12月

2013年11月

2013年10月

2013年09月

2013年08月

2013年07月

2013年06月

2013年05月

2013年04月

2013年03月

2013年02月

2013年01月

2012年12月

2012年11月

2012年10月

2012年09月

2012年08月

2012年07月

2012年06月

2012年05月

2012年04月

2012年03月

2012年02月

2012年01月

2011年12月

2011年11月

2011年10月

2011年09月

2011年07月

2011年06月

2011年05月

2011年04月

2011年03月

2011年02月

2011年01月

2010年09月

2010年08月

2010年07月

カレンダー

2021年5月
日	月	火	水	木	金	土
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

前月

次月

goo blog お知らせ

	ブログを読むだけ。毎月の訪問日数に応じてポイント進呈
	gooブロガーの今日のひとこと
	訪問者数に応じてdポイント最大1,000pt当たる！
	goo blogは20周年を迎えました！