PostgreSQL でのベクトル検索

2025-02-08 17:08:52 | PostgreSQL

PostgreSQL でベクトル検索を行う方法のメモ。

テーブル作成・データ登録

postgres ユーザで test_vector1 DB で vector を利用できるようにします。

$ psql -h 127.0.0.1 -U postgres
# create databse test_vector1;
# \connect test_vector1
# create extension vector;

テーブルを作成し、ベクトルの次元の制約を追加

# create table vector1 (
    id      integer not null,
    name    varchar(256) not null,
    vec     vector(2),
    primary key (id)
);

# alter table vector1 add check (vector_dims(vec::vector) = 2);

データを登録

# insert into vector1 (id, name, vec) values (0, 'name_000', array[0.0, 0.0]);
# insert into vector1 (id, name, vec) values (1, 'name_001', array[0.0, 1.0]);
# insert into vector1 (id, name, vec) values (2, 'name_002', array[1.0, 0.0]);
# insert into vector1 (id, name, vec) values (3, 'name_003', array[2.0, 1.0]);
# insert into vector1 (id, name, vec) values (4, 'name_004', array[1.0, 2.0]);
# insert into vector1 (id, name, vec) values (5, 'name_005', array[2.0, 2.0]);
# insert into vector1 (id, name, vec) values (6, 'name_006', array[3.0, 1.0]);
# insert into vector1 (id, name, vec) values (7, 'name_007', array[3.0, 2.0]);
# insert into vector1 (id, name, vec) values (8, 'name_008', array[1.0, 3.0]);
# insert into vector1 (id, name, vec) values (9, 'name_009', array[2.0, 3.0]);

検索

<->: 距離での検索

# select * from vector1 order by vec <-> '[1.5, 1.5]' limit 3;

 id |   name   |  vec
----+----------+-------
  3 | name_003 | [2,1]
  4 | name_004 | [1,2]
  5 | name_005 | [2,2]

<=>: コサインでの検索

#select * from vector1 order by vec <=> '[1.5, 1.5]' limit 3;

 id |   name   |  vec
----+----------+-------
  5 | name_005 | [2,2]
  7 | name_007 | [3,2]
  9 | name_009 | [2,3]

<#>: 内積 * -1での検索

select * from vector1 order by vec <#> '[1.5, 1.5]' limit 3;

 id |   name   |  vec
----+----------+-------
  7 | name_007 | [3,2]
  9 | name_009 | [2,3]
  6 | name_006 | [3,1]

条件付き検索

# select * from vector1 where id % 2 = 0 order by vec <-> '[1.5, 1.5]' limit 3;

 id |   name   |  vec
----+----------+-------
  4 | name_004 | [1,2]
  2 | name_002 | [1,0]
  6 | name_006 | [3,1]

PostgreSQL の pgvector のインストール

2025-02-02 13:29:41 | PostgreSQL

PostgreSQL の pgvector のインストール方法のメモ。

redhat-rpm-config のインストール

redhat-rpm-config がインストールされていない場合、以下で redhat-rpm-config をインストールする。

sudo yum install redhat-rpm-config

pgvector のインストール

ソースをダウンロード

git clone --branch v0.8.0 https://github.com/pgvector/pgvector.git

環境変数の設定

export PATH="${PATH}:/usr/pgsql-17/bin"
export PG_CONFIG=/usr/pgsql-17/bin/pg_config

インストール

cd pgvector
make
sudo --preserve-env=PG_CONFIG make install

確認

ls /usr/pgsql-17/lib/bitcode

vector  vector.index.bc

参考

https://github.com/pgvector/pgvector/blob/master/README.md

PostgreSQL の pgxs を rpm でインストール

2025-02-02 13:04:00 | PostgreSQL

PostgreSQL の拡張構築基盤の pgxs を rpm でインストールする方法のメモ。

以下で postgresql17-devel をインストールしようとしたところ、perl(IPC::Run) が必要というエラーが出力されました。

sudo rpm install postgresql17-devel

Error:
 Problem: cannot install the best candidate for the job
  - nothing provides perl(IPC::Run) needed by postgresql17-devel-17.2-1PGDG.rhel9.x86_64 from pgdg17
(try to add '--skip-broken' to skip uninstallable packages or '--nobest' to use not only best candidate packages)

そこで、以下を参考に Perl(IPC::Run) をインストールしてから、postgresql17-devel をインストールします。

https://ossc-db.github.io/pg_bulkload/pg_bulkload-ja.html#install

Perl(IPC::Run) のインストール

sudo yum --enablerepo=crb install perl-IPC-Run

postgresql17-devel のインストール

sudo yum install postgresql17-devel

確認

ls /usr/pgsql-17/lib/pgxs
config  src

PostgreSQL 17 のインストール

2025-01-19 17:57:52 | PostgreSQL

Rocky Linux に PostgreSQL 17 をインストールする方法のメモ。

リポジトリ追加

sudo yum install https://download.postgresql.org/pub/repos/yum/reporpms/EL-9-x86_64/pgdg-redhat-repo-latest.noarch.rpm

インストール

sudo yum install postgresql17-server

初期化

sudo su -
postgresql-17-setup initdb

起動

sudo su -
systemctl start postgresql-17

停止

sudo su -
systemctl stop postgresql-17

ログイン・パスワード設定

sudo su - postgres
psql

\password
{パスワード入力}

\q

別ユーザでのログイン

psql -h {ホスト名} -p {ポート番号} -U {ロール名(ユーザ名)} -d {データベース名}

linux に nvm をインストールして node をインストール

2024-12-07 20:26:29 | Node.js

linux に nvm をインストールして、node をインストールする方法のメモ。

以下で nvm のインストール方法を確認します。

https://github.com/nvm-sh/nvm#installing-and-updating

以下を実行します。

curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash

上記のコマンドを実行すると、.bashrc に以下が追加されます。

export NVM_DIR="$HOME/.nvm"
[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh"  # This loads nvm
[ -s "$NVM_DIR/bash_completion" ] && \. "$NVM_DIR/bash_completion"  # This loads nvm bash_completion

新たなターミナルを開くか、source .bashrc を実行します。

node version 20 をインストールするには以下を実行します。

nvm install 20

以下でインストールされた node のバージョンを確認することができます。

node --version

v20.18.1

BigQuery での改行を含む文字列と値の埋め込み

2024-11-09 13:02:03 | BigQuery

BigQuery で改行を含む文字列を使用する方法と、文字列に値を埋め込む方法のメモ。

改行を含む文字列

文字列を ''' でくくると改行を含む文字列を使用することができます。

select '''
abc
def
ghi
''';

実行結果

abc
def
ghi

文字列への値の埋め込み

format() を使用して文字列に値を埋め込むことができます。

select format('''
string: "%s"
integer: "%d"
float: "%f"
''', 'abc', 123, 123.456);

実行結果

string: "abc" integer: "123" float: "123.456000"

BigQuery での文字列の正規化

2024-11-03 11:51:07 | BigQuery

BigQuery での文字列を正規化する方法のメモ。

ここでは以下の文字列の正規化を行います。

全角英数記号⇒半角、半角カナ⇒全角など
英大文字⇒子文字
制御文字（改行、タブ等）を半角空白に
連続する空白文字を1文字に
先頭、末尾の空白文字の削除

全角英数記号⇒半角、半角カナ⇒全角など

normalize(string, "NFKC") で正規化を行います。

実行例

select normalize("ＡＢＣ０１２！＃＄ｶﾅ", NFKC);
ABC012!#$カナ

英大文字⇒小文字

lower(string) で英大文字を小文字に変換します。

実行例

select lower("ＡＢＣABC");
ａｂｃabc

制御文字（改行、タブ等）を半角空白に

正規表現で制御文字を半角空白に置換します。

select regexp_replace('a\n\r\tb', r'[\x00-\x1f\x7f]', ' ');
a   b

to_code_points(string) で各文字をコードポイントに変換し、 code_points_to_bytes(code_points) でさらにバイトに変換し、 to_hex(bytes) で16進数に変換してみます。

select
  to_hex(code_points_to_bytes(to_code_points(
    regexp_replace('a\n\r\tb', r'[\x00-\x19\x7f]', ' ')
  )));
6120202062

61:'a'、20:' '、62:'b' のため、改行やタブを半角空白に変換できていることがわかります。

連続する空白文字を1文字に

正規表現で連続する空白文字を1文字に変換します。

select regexp_replace('a b  c   d    e', r'[ ]{2,}', ' ');
a b c d e

先頭、末尾の空白文字の削除

先頭、末尾の不要な空白文字列を正規表現で削除します。

select regexp_replace('  a b c  ', r'(?:^[ ]+|[ ]+$)', '');
a b c

上記をまとめた正規化関数

create or replace function
  dataset.normalize_string(str string)
as (
  regexp_replace(
    regexp_replace(
      regexp_replace(
        lower(
          normalize(str, NFKC)
        )
        , r'[\x00-\x1f\x7f]', ' '
      )
      , r'[ ]{2,}', ' '
    )
    , r'(?:^[ ]+|[ ]+$)', ''
  )
);

select dataset.normalize_string(' ＡＢＣ  ｶﾅ　　＃！  ');
abc カナ #!

GCP の LB でリダイレクト

2024-10-17 23:39:39 | GCP

GCP の LB でのリダイレクトの設定方法のメモ。

https://aaa.abc.com/ 配下のページを https://www.abc.com/aaa/ にリダイレクトする設定例。

aaa.abc.com の LB の設定の pathRules: に以下を追記

- paths:
  - /*
  urlRedirect:
    hostRedirect: www.abc.com
    prefixRedirect: /aaa/

python で辞書から指定の項目のみの辞書を作成

2024-10-17 22:43:20 | python

python で辞書から指定の項目のみの辞書を作成する方法のメモ。

> a = {'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}
> print(a)
{'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}

> b = {k: a[k] for k in ['a', 'b', 'c']}
> print(b)
{'a': 0, 'b': 1, 'c': 2}

jq コマンドで jsonl の要素の追加・削除

2024-10-11 09:45:37 | linux

jq コマンドで jsonl の要素を追加・追加する方法のメモ。

json データ

$ cat test1.jsonl

{"abc": "abc", "def": "def", "ghi": "ghi"}

要素の追加

以下で "jkl" に "abc" の値を追加します。

$ cat test1.jsonl | jq -c '. + {"jkl": .abc}'

{"abc":"abc","def":"def","ghi":"ghi","jkl":"abc"}

要素の削除

以下で "ghi" の要素を削除します。

$ cat test1.jsonl | jq -c 'del(.ghi)'

{"abc":"abc","def":"def"}

要素の追加・削除

| で連結することで追加、削除を1コマンドで実行します。

$  cat test1.jsonl | jq -c '. + {"jkl": .abc} | del(.ghi)'

{"abc":"abc","def":"def","jkl":"abc"}

python でクラス名、メンバ変数を参照

2024-10-06 14:44:09 | python

python でインスタンスのクラス名、メンバ変数を参照する方法のメモ。

インスタンスのクラス名は __class__.__name__ で参照でき、メンバ変数は __dict__ で参照できます。

class Class1:
    def __init__(self, num, str):
        self.num = num
        self.str = str


i1 = Class1('abc', 123)
i2 = Class1('def', 456)

# クラス名
print(i1.__class__.__name__)
print(i2.__class__.__name__)

# メンバ変数の値
print(i1.__dict__)
print(i2.__dict__)

■実行結果

Class1
Class1
{'num': 'abc', 'str': 123}
{'num': 'def', 'str': 456}

python で出力ストリームを flush する方法

2024-10-01 23:39:31 | python

python で出力ストリームを flush する方法のメモ

print() で flush=True を指定

print('Hello World!', flush=True)

sys.stdout.flush() を実行

print('Hello World!')
sys.stdout.flush()

base64 エンコードされた画像を Cloud Storage にアップロード

2024-09-30 23:00:03 | python

base64 エンコードされた画像を Cloud Storage にアップロードする方法のメモ。

プログラム

アップロード時に content_type を指定しないと、テキストデータとして扱われます。

# usage: uplaod_base64_image_to_gcs.py {bucket} {dir} < {json}

import sys
import base64
import json
from google.cloud import storage

def upload_to_gcs(bucket, dir, obj):
    img_bytes = base64.b64decode(obj['image_base64'])
    blob = bucket.blob(f"{dir}/{obj['file_name']}")
    blob.upload_from_string(img_bytes, content_type='image/jpeg')

def main():
    gcs_bucket = sys.argv[1]
    gcs_dir = sys.argv[2]

    client = storage.Client()
    bucket = client.bucket(gcs_bucket)

    for line in sys.stdin:
        obj = json.loads(line)
        upload_to_gcs(bucket, gcs_dir, obj)

    return 0

if __name__ == '__main__':
    res = main()
    exit(res)

データ

$ cat data.jsonl
{"id":"01","file_name":"01.jpg","image_base64":"/9j/4AAQSkZJ..."}
{"id":"02","file_name":"02.jpg","image_base64":"/9j/4AAQSkZJ..."}

プログラム実行・実行結果確認

$ cat data.jsonl | python upload_base64_image_to_gcs.py gcs-bucket-001 images

$ gsutil ls -L gs://gcs-bucket-001/images/*.jpg
gs://gcs-bucket-001/images/01.jpg:
    ...
    Storage class:          STANDARD
    Content-Length:         28300
    Content-Type:           image/jpeg
    ...
gs://gcs-bucket-001/images/02.jpg:
    ...
    Storage class:          STANDARD
    Content-Length:         21648
    Content-Type:           image/jpeg
    ...

python で画像を base64 エンコード

2024-09-29 12:21:34 | python

python で画像を base64 エンコードする方法のメモ。

プログラム

import sys
import io
import base64
import json


def read_image_file(image_file):
    with open(image_file, 'rb') as inst:
        image = inst.read()
        return image

    return None


def main():
    for i in range(1, len(sys.argv)):
        image_file = sys.argv[i]
        image = read_image_file(image_file)
        image_bytes = io.BytesIO(image).read()
        image_base64 = base64.b64encode(image_bytes).decode('utf-8')
        obj = {
            'id': i,
            'image_base64': image_base64,
        }
        print(json.dumps(obj, ensure_ascii=False))

    return 0

if __name__ == '__main__':
    res = main()
    exit(res)

実行結果

$ python image_base64.py img1.jpg img2.jpg

{"id": 1, "file_name": "img1.jpg", "image_base64": "/9j/4AA..."}
{"id": 2, "file_name": "img2.jpg", "image_base64": "/9j/4AA..."}

BigQuery で日本時間での日付、日時取得

2024-09-29 11:32:25 | BigQuery

BigQuery で日本時間での日付、日時を取得する方法のメモ。

日本時間での日付、日時は current_date()、current_datetime() の引数に 'Asia/Tokyo' を指定することで取得できます。

SQL

select
  current_date() as utc_date
  , current_date('Asia/Tokyo') as jpn_date
  , current_datetime() as utc_datetime
  , current_datetime('Asia/Tokyo') as jpn_datetime
;

実行結果

[{
  "utc_date": "2024-09-29",
  "jpn_date": "2024-09-29",
  "utc_datetime": "2024-09-29T02:26:20.153788",
  "jpn_datetime": "2024-09-29T11:26:20.153788"
}]

バックナンバー

2025年03月

2025年02月

2025年01月

2024年12月

2024年11月

2024年10月

2024年09月

2024年08月

2024年07月

2024年06月

2024年05月

2024年04月

2024年03月

2024年02月

2024年01月

2023年12月

2023年11月

2023年10月

2023年09月

2023年08月

2023年07月

2023年05月

2023年04月

2023年03月

2023年02月

2023年01月

2022年12月

2022年11月

2022年10月

2022年09月

2022年08月

2022年07月

2022年06月

2022年05月

2022年04月

2022年03月

2022年02月

2022年01月

2021年12月

2021年11月

2021年10月

2021年09月

2021年07月

2021年06月

2021年04月

2021年03月

2021年02月

2021年01月

2020年11月

2020年09月

2020年08月

2020年07月

2020年06月

2020年05月

2020年04月

2020年03月

2020年02月

2019年12月

2019年11月

2019年10月

2019年09月

2019年08月

2019年07月

2019年06月

2019年04月

2019年02月

2019年01月

2018年12月

2018年11月

2018年10月

2018年09月

2018年07月

2018年06月

2013年09月

2013年06月

2012年07月

2012年06月

2012年05月

2012年01月

2011年11月

2011年09月

2011年08月

2011年07月

2011年06月

2011年05月

2011年04月

2011年03月

2011年02月

2011年01月

2010年12月

2010年11月

2007年05月

2007年03月

2007年02月

2007年01月

2006年12月

2006年11月

2006年10月

2006年09月

2006年08月

2006年07月

2006年06月

2006年05月

2006年04月

2006年03月

カレンダー

前月

次月

goo blog おすすめ

	「#gooblog引越し」で体験談を募集中
	【コメント募集中】goo blogでの思い出は？
	おすすめブログ

@goo_blog

お客さまのご利用端末からの情報の外部送信について

goo blog お知らせ

	【11/18】goo blogサービス終了のお知らせ
	【PR】ドコモのサブスク【GOLF me！】初月無料
	【コメント募集中】goo blogでの思い出は？
	「#gooblog引越し」で体験談を募集中

2025年7月
日	月	火	水	木	金	土
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

python、rubyなどのプログラミング、MySQL、サーバーの設定などの備忘録。レゴの写真も。

テーブル作成・データ登録

検索

条件付き検索

redhat-rpm-config のインストール

pgvector のインストール

参考

Perl(IPC::Run) のインストール

postgresql17-devel のインストール

リポジトリ追加

インストール

初期化

起動

停止

ログイン・パスワード設定

別ユーザでのログイン

改行を含む文字列

文字列への値の埋め込み

全角英数記号⇒半角、半角カナ⇒全角など

英大文字⇒小文字

制御文字（改行、タブ等）を半角空白に

連続する空白文字を1文字に

先頭、末尾の空白文字の削除

上記をまとめた正規化関数

json データ

要素の追加

要素の削除

要素の追加・削除

print() で flush=True を指定

sys.stdout.flush() を実行

プログラム

データ

プログラム実行・実行結果確認

実行結果

SQL

実行結果