RDFダンプ - 日本語データの取り出し | dmoz エディタ日記

RDFダンプの説明文書は下記にあります。
・ Open Directory RDF Dump(英語)（Excite訳、Google訳）。
ダンプファイルのサンプルはこんな感じ
・ http://rdf.dmoz.org/rdf/content.example.txt

なにしろ全ての言語階層のデータを含む巨大なファイルなので、私はとりあえず下記のような簡単なperlスクリプトを走らせて、日本語データを取り出しています。


---
#!/usr/bin/perl
open DB, "< content.rdf.u8" or die "Error1($!)\n";
open OD, "> content.ja.rdf.u8" or die "Error2($!)\n";
$wjflag = 0;
while ( <DB> ) {
 if ( /\<Topic/ ) {
  $wjflag = ( /r\:id\=\"Top\/World\/Japanese/ )? 1: 0;
 }
 if ( $wjflag ) { # World/Japaneseトピック内のみ書き出す
  print OD;
 }
}
close OD;
close DB;
---

掲載サイトのデータは ExternalPageタグに囲まれ、必ず改行されていますので、リスト一覧のテキストファイルを出すのはこんな感じ。


---
#!/usr/bin/perl
open DB, "< content.ja.rdf.u8" or die "Error1($!)\n";
open OD, "> content.txt" or die "Error2($!)\n";
while ( <DB> ) {
 if ( /<\/ExternalPage/ ) {  #タグの終わりごと、リスト出力
  print OD "$xtopic\t$xurl\t\t$xtitle\t$xdesc\n";
 } elsif ( /<ExternalPage/ ) { #タグの最初に変数初期化。URL値をセット
  $xurl = $xtitle = $xdesc = "";
  /about="(.*)"/ ;
  $xurl = $1;
 } elsif ( /<d:Title/ ) { # タイトル
  />(.*)<\// ;
  $xtitle = $1;
 } elsif ( /<d:Desc/ ) { # 説明文
  />(.*)<\// ;
  $xdesc = $1;
 } elsif ( /<topic/ ) { # カテゴリ名
  />(.*)<\// ;
  $xtopic = $1;
 }
}
close OD;
close DB;
---

処理に時間はかかりますが、こんなあんばいで掲載データ一覧は取り出せるということで。

※ データファイルの文字コードは UTF-8 です。

ブログ画像一覧を見る

このブログをフォローする

dmoz エディタ日記 - ODP (Open Directory Project) 日本語階層

ODP 日本語階層のエディタ有志による活動紹介と覚え書き。

RDFダンプ - 日本語データの取り出し