2022年2月のブログ記事一覧(2ページ目)-裏 RjpWiki

Julia: HypothesisTests.PowerDivergenceTest() の問題点

2022年02月04日 | ブログラミング

Julia の HypothesisTests.PowerDivergenceTest() は問題がある。

対象とする分割表に，度数が 0 のセルがあると不適切な（役に立たない）結果を返す。

julia> x = [4 5 2 0; 0 7 6 1; 1 0 3 1]
3×4 Matrix{Int64}:
4 5 2 0
0 7 6 1
1 0 3 1

julia> using HypothesisTests

julia> result = PowerDivergenceTest(x, lambda=0.0);

julia> println(result.stat)
NaN

julia> println(pvalue(result))
NaN

そこで，以下のような関数を書く。

julia> function PDTsummary(result::PowerDivergenceTest; resid::Bool=false)
# 結果を簡潔に表示するための関数
function printmatrix(name, x)
println("\n$name")
Base.print_matrix(stdout, round.(x, digits=5), "", " ", "\n")
end
println("chisq. = $(round(result.stat, digits=5)), df = $(result.df), p.value = $(round(pvalue(result), digits=5))")
if resid
printmatrix("residuals", result.residuals)
printmatrix("standardized residuals", result.stdresiduals)
end
end
PDTsummary (generic function with 1 method)

julia> function g2stat(x, nrows, ncols, n, correct)
# セルに 0 があっても正しい答えを出す
# correct=true で，連続性の補正も行うことができる
ln(n) = n .== 0 ? 0 : n .* log.(n)
n = sum(x) # 全サンプルサイズ
n1 = sum(x, dims=2) # 行和
n2 = sum(x, dims=1) # 列和
G2 = 2*(sum(ln.(x)) - sum(ln.(n1)) - sum(ln.(n2)) + ln(n)) # G 統計量
correct && (G2 /= 1 + (n * sum(1 ./ n1) - 1) * (n * sum(1 ./ n2) - 1) / (6n * nrows * ncols)) # 連続性の補正
df = (nrows - 1) * (ncols - 1) # G の自由度
G2, df
end
g2stat (generic function with 1 method)

julia> function chisq_test(x::AbstractMatrix{T}; correct::Bool=false, resid::Bool=false, lambda::Float64=1.0) where T<:Integer
# 分割表を入力してχ二乗検定，G2 検定
nrows, ncols = size(x)
n = sum(x)
if lambda == 1.0 && correct && nrows == 2 && ncols == 2
a, c, b, d = x
stat = n*max(0, abs(a*d - b*c) - 0.5n)^2 / ((a+b)*(c+d)*(a+c)*(b+d))
result = PowerDivergenceTest(lambda, ones(4), stat, 1, x, n, ones(4), x, x, x)
elseif lambda == 1.0
result = PowerDivergenceTest(x; lambda)
elseif lambda == 0.0
G2, df = g2stat(x, nrows, ncols, n, correct)
vec = ones(nrows*ncols)
mat = reshape(vec, nrows, ncols)
result = PowerDivergenceTest(lambda, vec, G2, df, mat, n, vec, mat, mat, mat)
end
PDTsummary(result; resid)
end
chisq_test (generic function with 1 method)

julia> function G2_test(x::AbstractMatrix{T}; correct::Bool=false, resid::Bool=false) where T<:Integer
chisq_test(x; correct, resid, lambda=0.0)
end
G2_test (generic function with 1 method)

これで，

julia> G2_test(x)
chisq. = 15.36459, df = 6, p.value = 0.0176

という正しい結果を返す。

ついでに，連続性の補正もできるようになっているので，

julia> G2_test(x, correct=true)
chisq. = 13.77649, df = 6, p.value = 0.03224

のようになる。

頑張ってます。クリックお願いします。

Julia: sin22.5°、cos22.5°、tan22.5°はどんな数？

2022年02月03日 | ブログラミング

SymPy でやった。簡単だった。

頑張ってます。クリックお願いします。

Julia: Yates の連続性の補正

2022年02月03日 | ブログラミング

またまた HypothesisTests.ChisqTest() の件であるが，Julia は Yates の連続性の補正は眼中にないらしい。いろいろなページにも一切出てこない。

で，以下のように拡張した。

using HypothesisTests

function summary(result)
println("chisq. = $(round(result.stat, digits=5)), df = $(result.df), p.value = $(round(pvalue(result), digits=5))")
end

function chisq_test(x::AbstractMatrix{T}; correct::Bool=false) where T<:Integer
nrows, ncols = size(x)
if correct && nrows == 2 && ncols == 2
n = sum(x)
a, c, b, d = float.(x)
stat = n*max(0, abs(a*d - b*c) - 0.5n)^2 / ((a+b)*(c+d)*(a+c)*(b+d))
result = PowerDivergenceTest(1, ones(4), stat, 1, x, n, ones(4), x, x, x)
else
result = PowerDivergenceTest(x, lambda=1.0)
end
summary(result)
end

julia> x = [10 3; 4 12]
2×2 Matrix{Int64}:
10 3
4 12

julia> result = ChisqTest(x);

julia> println("chisq. = $(result.stat), df = $(result.df), p.value = $(pvalue(result))")
chisq. = 7.743956043956044, df = 1, p.value = 0.005389260671694261

julia> chisq_test(x)
chisq. = 7.74396, df = 1, p.value = 0.00539

julia> chisq_test(x, correct=true)
chisq. = 5.80415, df = 1, p.value = 0.01599

頑張ってます。クリックお願いします。

Julia: HypothesisTests.ChisqTest() の問題点

2022年02月03日 | ブログラミング

Julia の HypothesisTests.ChisqTest() は色々問題がある。

今までもいくつか指摘したところであるが，今回は独立性の検定（χ二乗検定）に引数として 2 つのベクトルを与える場合について書く。

ChisqTest() の定義で，引数に 2 変数のベクトルを与える場合は以下の 2 つの関数が定義されている。

https://github.com/JuliaStats/HypothesisTests.jl/blob/413a901612d5ca9551fe3e28a9a0646dd4b3c98d/src/power_divergence.jl#L349-L370

function ChisqTest(x::AbstractVector{T}, y::AbstractVector{T}, levels::Levels{T}) where T<:Integer
d = counts(x, y, levels)
PowerDivergenceTest(d, lambda=1.0)
end

function ChisqTest(x::AbstractVector{T}, y::AbstractVector{T}, k::T) where T<:Integer
d = counts(x, y, k)
PowerDivergenceTest(d, lambda=1.0)
end

第 1 の問題は，引数で与えることができるのは整数ベクトルに限られるということである。

文字列ベクトルの場合も多いと思うが，一旦整数ベクトルに直してから ChisqTest() を呼ぶとか，freqtable() で集計した結果を ChisqTest() に渡すしかない。

しかし，もっと問題なのは levels とか k という意味不明の引数である。

Julia の常ではあるが，オンラインヘルプはおろか，ネット上にも説明がないし，使用例もない。

内部を見ると，集計表を作るのは counts() であるが，この関数のオンラインヘルプも，そもそも二重集計の例については触れられていない。

試行錯誤でやってみると，x, y の値のうち，第 3 引数で示した数値以下のものについて集計するようだ。
第 3 引数は，x, y の両方に適用されるので，カテゴリー数の違う変数の場合は　ChisqTest() は適用できなくなってしまう。

まず，x, y ともに 4 つのカテゴリー（1,2,3,4）を持つ場合にやってみよう。

using HypothesisTests
using FreqTables

using Random; Random.seed!(123)
x = rand(1:4, 100)
y = rand(1:4, 100)
result = ChisqTest(x, y, 4)
println("chisq. = $(result.stat), df = $(result.df), p.value = $(pvalue(result))")

chisq. = 7.673743524028154, df = 9, p.value = 0.5673295593772689

ここで，自前の関数を定義しておく。

2 つの整数ベクトルを引数に取る chisq_test() と，結果の出力をする summary() を定義しておく。

function summary(result)
println("chisq. = $(round(result.stat, digits=5)), df = $(result.df), p.value = $(round(pvalue(result), digits=5))")
end

function chisq_test(x::AbstractVector{T}, y::AbstractVector{T}) where T<:Integer
d = freqtable(x, y)
result = PowerDivergenceTest(d, lambda=1.0)
summary(result)
end

chisq_test() の結果は，ChisqTest() と同じになる。

chisq_test(x, y)

chisq. = 7.67374, df = 9, p.value = 0.56733

次に，第 1 引数はカテゴリー数が 3 個，第 2 引数はカテゴリー数が 4 個になる整数ベクトルを与える場合を検討する。

まず，正しい結果を，ChisqTest() に freqtables() で得られた集計表を与えることで求める。

using Random; Random.seed!(456)
x = rand(1:3, 100)
y = rand(1:4, 100)
result = ChisqTest(freqtable(x, y))
println("chisq. = $(result.stat), df = $(result.df), p.value = $(pvalue(result))")

chisq. = 5.209486804755349, df = 6, p.value = 0.5172394227637924

次に，ChisqTest() に 2 つのベクトルを与える場合を見てみる。第 3 引数に 3, 4 のいずれを与えても（もちろん，その他の値を与えても）正しい答えは得られない。

自由度（df）の数値を見ると，(第 3 引数の数値 - 1)^2 なので，PowerDivergenceTest() を呼出すときの集計表 d = counts(x, y, levels) は levels × levels の正方行列になっていることがわかる。当然，それは不適切である。

result = ChisqTest(x, y, 3)
println("chisq. = $(result.stat), df = $(result.df), p.value = $(pvalue(result))")

chisq. = 3.4795508838987104, df = 4, p.value = 0.480994482228378

result = ChisqTest(x, y, 4)
println("chisq. = $(result.stat), df = $(result.df), p.value = $(pvalue(result))")

chisq. = NaN, df = 9, p.value = NaN

chisq_test() は正しい結果を返す。

関数定義は以下の通り。

function chisq_test(x::AbstractVector{T}, y::AbstractVector{T}) where T<:Integer
d = freqtable(x, y)
result = PowerDivergenceTest(d, lambda=1.0)
summary(result)
end

chisq_test(x, y)

chisq. = 5.20949, df = 6, p.value = 0.51724

次に，2 個の文字列ベクトルを与える場合を見てみる。

ChisqTses() は，文字列ベクトルを扱えない。事前に freqtable() で集計した結果で呼ぶ必要がある。

x = ["Yes", "Yes", "No", "No", "No", "Yes", "No", "Yes", "Yes", "No",
"No", "Yes", "Yes", "No", "Yes", "Yes", "No", "No", "No", "No"]
y = ["Hi", "Lo", "Med", "Med", "Lo", "Hi", "Lo", "Hi", "Lo", "Lo",
"Med", "Med", "Lo", "Med", "Med", "Lo", "Lo", "Hi", "Lo", "Med"]
tbl = freqtable(x, y)
tbl

2×3 Named Matrix{Int64}
Dim1 ╲ Dim2 │ Hi Lo Med
────────────┼──────────────
No │ 1 5 5
Yes │ 3 4 2

result = ChisqTest(tbl)
println("chisq. = $(result.stat), df = $(result.df), p.value = $(pvalue(result))")

chisq. = 2.2190155523488855, df = 2, p.value = 0.32972121777778757

chisq_test(x, y) は問題無く動く。

関数定義は以下の通り。

function chisq_test(x::AbstractVector{T}, y::AbstractVector{T}) where T<:String
d = freqtable(x, y)
result = PowerDivergenceTest(d, lambda=1.0)
summary(result)
end

chisq_test(x, y)

chisq. = 2.21902, df = 2, p.value = 0.32972

どちらかが整数ベクトル，もう一方が文字列ベクトルの場合も，以下の関数定義でカバーできる。

function chisq_test(x::AbstractVector{T}, y::AbstractVector{U}) where {T<:String,U<:Integer}
d = freqtable(x, y)
result = PowerDivergenceTest(d, lambda=1.0)
summary(result)
end

function chisq_test(x::AbstractVector{T}, theta0::Vector{U} = ones(length(x))/length(x)) where {T<:Integer,U<:AbstractFloat}
result = PowerDivergenceTest(reshape(x, length(x), 1), lambda=1.0, theta0=theta0)
summary(result)
end

using Random; Random.seed!(789)
x = rand(["foo", "bar", "baz"], 100)
y = rand(1:4, 100)

result = ChisqTest(freqtable(x, y))
println("chisq. = $(result.stat), df = $(result.df), p.value = $(pvalue(result))")

chisq. = 8.798699647157212, df = 6, p.value = 0.18521956943031795

chisq_test(x, y)

chisq. = 8.7987, df = 6, p.value = 0.18522

chisq_test(y, x)

chisq. = 8.7987, df = 6, p.value = 0.18522

2 つのベクトルを与える chisq_test() を 4 通り定義したのだが，1 つの定義だけで済むはず。

頑張ってます。クリックお願いします。

Julia: String3, String7 って，何？

2022年02月03日 | ブログラミング

それが起こったのは，たぶん 2022 年 1 月末のある日。

以前にはなんのエラーもなく実行できていたプログラムでエラーが生じる。FreqTables.freqtable() で作成したクロス集計表の列や行を抽出できなくなっていた。

再現すると以下のようである。

まず以下のようなデータを読む。

io = IOBuffer("""
ID,Weight,Height,Age,BloodType,Gender
125,44.6,157.6,17,A,Male
321,57.3,158.8,26,B,Male
437,54.3,162.2,22,AB,Female
426,45.4,160.8,31,O,Male
243,48.5,174.5,34,B,Female
""")

CSV.read() の第 1 引数は IOBuffer() の戻り値をとることができる。

using CSV, DataFrames
df = CSV.read(io, DataFrame)

もう気づいてしまった人もいるかもしれないが，先に進む。

df.Gender と df.BloodType のクロス集計表を作る。

using FreqTables
tbl = freqtable(df.Gender, df.BloodType)

2×4 Named Matrix{Int64}
Dim1 ╲ Dim2 │ "A" "AB" "B" "O"
────────────┼───────────────────────
"Female" │ 0 1 1 0
"Male" │ 1 0 1 1

集計表 tbl の 1 行目，2 列目を取り出してみる。

以下はいずれもエラーになる。
MethodError: no method matching indices(::OrderedCollections.OrderedDict{String7, Int64}, ::String)

tbl["Female", :]
tbl[:, "AB"]

以下はなんの問題もない。

tbl[1, :]

4-element Named Vector{Int64}
Dim2 │
──────┼──
"A" │ 0
"AB" │ 1
"B" │ 1
"O" │ 0

tbl[:, 2]

2-element Named Vector{Int64}
Dim1 │
─────────┼──
"Female" │ 1
"Male" │ 0

あれこれやって，半日が過ぎた（大げさ）。

幸い，以前実行した結果が残っているので，ダメ元で今の出力結果と比較した。

気がついた。

BloodType と Gender の型が String3 と String7 になっている。

今までは両方とも String だった。

BloodType と Gender の型を String に変換しよう。

df[!, :BloodType] = string.(df.BloodType)
df[!, :Gender] = string.(df.Gender)
df

これでもう一度集計表を作り，行と列を取り出してみよう。

tbl = freqtable(df.Gender, df.BloodType)

2×4 Named Matrix{Int64}
Dim1 ╲ Dim2 │ A AB B O
────────────┼───────────────
Female │ 0 1 1 0
Male │ 1 0 1 1

tbl["Female", :]

4-element Named Vector{Int64}
Dim2 │
──────┼──
A │ 0
AB │ 1
B │ 1
O │ 0

tbl[1, :]

4-element Named Vector{Int64}
Dim2 │
──────┼──
A │ 0
AB │ 1
B │ 1
O │ 0

tbl[:, "AB"]

2-element Named Vector{Int64}
Dim1 │
───────┼──
Female │ 1
Male │ 0

tbl[:, 2]

2-element Named Vector{Int64}
Dim1 │
───────┼──
Female │ 1
Male │ 0

前と同じようにちゃんと動いている。

そもそも String3 と String7 とはなんぞやと，ぐぐってみると以下のようなことがわかった。

・固定長文字列(Fixed width strings) ということで，String1, String3, String7, ..., String(2^8-1) がある。
・かなり前からあるが，CSV.jl 0.9.1 ではすでに取り込まれていた。2022/02/03 では CSV v0.10.2 だ。
・パフォーマンスが優れている。
・まだサポートしていないパッケージがある。まさに freqtable() がそうだった。

そのうちに freqtable() も固定長文字列に対応するのだろうか。

頑張ってます。クリックお願いします。

Julia で統計解析--その6　数値データの可視化

2022年02月02日 | 統計学

これらの文書群は github で管理することとした

最新バージョン 2022-02-02 17:05

以下を参照のこと

https://r-de-r.github.io/stats/Julia-stats6.html

1.　数値データの可視化
　1.1.　ヒストグラム
　　1.1.1.　一標本の場合
　　1.3.2.　複数標本の場合
　1.4.　ボックスプロット（箱ひげ図）
　1.5.　バイオリンプロット
　1.6.　ドットプロット
　1.7.　カーネル密度推定の描画
　1.8.　Q-Q プロット
　1.9.　散布図
　1.10.　カーネル密度推定
　1.11.　散布図行列

頑張ってます。クリックお願いします。

Julia で統計解析--その5　離散データの可視化

2022年02月02日 | 統計学

これらの文書群は github で管理することとした

最新バージョン 2022-02-02 16:54

以下を参照のこと

https://r-de-r.github.io/stats/Julia-stats5.html

1.　離散データの可視化
　1.1.　例示に使用するデータセット
　　1.1.1.　カテゴリーデータ
　1.2.　棒グラフ
　　1.2.1.　一標本の場合
　　1.2.2.　二標本以上の場合
　　　1.2.2.1.　横に並べる棒グラフ
　　　1.2.2.2.　積み上げ棒グラフ
　　1.2.3.　複数のグラフを行列状にまとめて表示する方法
　1.3.　帯グラフ
　1.4.　モザイクプロット
　1.5.　バルーンプロット

頑張ってます。クリックお願いします。

Julia で統計解析--その4　集計

2022年02月02日 | 統計学

これらの文書群は github で管理することとした

最新バージョン 2022-02-02 16:51

以下を参照のこと

https://r-de-r.github.io/stats/Julia-stats4.html

1.　一変量統計
　1.1.　一変数の場合
　　1.1.1.　基礎統計量
　　1.1.2.　パーセンタイル値
　　1.1.3.　度数分布
　1.2.　複数の変数の場合
　　1.2.1.　eachcol() を使う
　　1.2.2.　describe() を使う
　　1.2.3.　combine(), select()/select!(), transform()/transform!() を使う
　1.3.　グループごとの記述統計量
　　1.3.1.　describe() を使う
　　1.3.2.　変数ごとに統計量を一覧表示
2.　二変量統計
　2.1.　二重クロス集計表
　2.2.　相関係数，共分散
　　2.2.1.　欠損値を含まない場合
　　　2.2.1.1.　それぞれの関数を使う
　　　2.2.1.2.　combine() を使う
　　2.2.2.　欠損値を含む場合
　　　2.2.2.1.　それぞれの関数を使う
　　　2.2.2.2.　combine() を使う
　2.3.　相関係数行列，分散・共分散行列
　　2.3.1.　欠損値を含まない場合
　　2.3.2.　欠損値を含む場合
3.　多変量統計
　3.1. 多重クロス集計表

頑張ってます。クリックお願いします。

PVアクセスランキングにほんブログ村

プロフィール

フォロー中フォローするフォローする

バックナンバー

2024年10月

2024年09月

2024年08月

2024年07月

2024年06月

2024年05月

2024年04月

2024年03月

2024年02月

2024年01月

2023年12月

2023年11月

2023年10月

2023年09月

2023年08月

2023年07月

2023年06月

2023年05月

2023年04月

2023年03月

2023年02月

2023年01月

2022年12月

2022年11月

2022年10月

2022年09月

2022年08月

2022年07月

2022年06月

2022年05月

2022年04月

2022年03月

2022年02月

2022年01月

2021年12月

2021年11月

2021年10月

2021年09月

2021年08月

2021年07月

2021年06月

2021年05月

2021年04月

2021年03月

2021年02月

2021年01月

2020年12月

2020年11月

2020年10月

2020年09月

2020年08月

2020年07月

2020年06月

2020年05月

2020年04月

2020年03月

2020年02月

2020年01月

2019年12月

2019年11月

2019年09月

2019年08月

2019年07月

2019年06月

2019年05月

2019年04月

2018年12月

2018年09月

2018年08月

2018年07月

2018年06月

2018年05月

2018年03月

2018年02月

2018年01月

2017年12月

2017年11月

2017年10月

2017年09月

2017年08月

2017年07月

2017年06月

2017年05月

2017年04月

2017年03月

2017年02月

2017年01月

2016年12月

2016年11月

2016年10月

2016年09月

2016年08月

2016年07月

2016年06月

2016年05月

2016年04月

2016年03月

2016年02月

2016年01月

2015年12月

2015年11月

2015年10月

2015年09月

2015年08月

2015年07月

2015年06月

2015年05月

2015年04月

2015年03月

2015年02月

2015年01月

2014年12月

2014年11月

2014年10月

2014年09月

2014年08月

2014年07月

2014年06月

2014年05月

2014年04月

2014年03月

2014年02月

2014年01月

2013年12月

2013年11月

2013年10月

2013年09月

2013年08月

2013年07月

2013年06月

2013年05月

2013年04月

2013年03月

2013年02月

2013年01月

2012年12月

2012年11月

2012年10月

2012年09月

2012年08月

2012年07月

2012年06月

2012年05月

2012年04月

2012年03月

2012年02月

2012年01月

2011年12月

2011年11月

2011年10月

2011年09月

2011年07月

2011年06月

2011年05月

2011年04月

2011年03月

2011年02月

2011年01月

2010年09月

2010年08月

2010年07月

カレンダー

2022年2月
日	月	火	水	木	金	土
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28

前月

次月

goo blog お知らせ

	【gooブロガー・先着】dアカウント連携でdポイント2,000pt
	ブログを読むだけ。毎月の訪問日数に応じてポイント進呈
	【コメント募集中】goo blogスタッフの気になったニュース
	gooブロガーの今日のひとこと
	訪問者数に応じてdポイント最大1,000pt当たる！