(2)構造化コンテンツで大規模言語モデル(LLM)を訓練すべき6つの理由

2024.02.22

(2)構造化コンテンツで大規模言語モデル(LLM)を訓練すべき6つの理由 6 Reasons to train your Large Language Models (LLM) with structured content

First, what does it mean to structure and enrich content? For many enterprises, most of their content assets are locked up in document format: .doc, .pdf, etc. This means the document itself
represents a combination of the content, design and output format. However, when you author content using a structured model like DITA you develop granular, reusable content components that can be easily repurposed for different audiences, channels, form factors and publication types. As you are creating these content components, you can also enrich them at the component level by adding valuable metadata.
まず、コンテンツを構造化し、充実させるとはどういう意味でしょうか?　多くの企業では、コンテンツ資産のほとんどが .doc、.pdf などのドキュメント形式で保管されています。これは、ドキュメント自体がコンテンツ、デザイン、出力形式の組み合わせを表すことを意味します。ただし、DITA のような構造化モデルを使用してコンテンツを作成する場合は、さまざまな視聴者、チャンネル、フォーム
ファクター、出版物の種類に簡単に再利用できる、きめ細かい再利用可能なコンテンツ
コンポーネントを開発することになります。これらのコンテンツコンポーネントを作成するときに、貴重なメタデータを追加してコンポーネント
レベルでコンテンツコンポーネントを強化することもできます。

You may say to yourself, okay, I understand why structuring and enriching content adds value, but it seems like a lot of work, and once my enterprise has access to a private LLM, won’t the LLM just be able to work its magic on my unstructured content? Sort of, yes, but if you care about accuracy, you are going to be disappointed. You don’t want GIGO, you want Quality-In Quality-Out, and that’s where the investment in structuring and enriching pays off, particularly when it comes to the dataset your enterprise will be using to fine tune whichever foundation or base model they will be using.
「それは良いが、コンテンツを構造化して充実させることは価値を高める理由は理解しているが、それは大変な作業のようだし、私の企業がプライベート LLM にアクセスできるようになれば、LLM はそ私の非構造化コンテンツについての魔法を発揮できるようにならないのか?」と、自分自身に言うかもしれません。確かにその通りですが、精度を重視する場合はがっかりするでしょう。 GIGO を望んでいるのではなく、Quality-In Quality-Out を望んでいます。特に、企業が使うことになる基盤や基礎モデルを微調整するために使用するデータセットに関しては、構造化と充実化への投資が報われます。

Training approaches – fine tuning versus hybrid search
訓練のアプローチ – 微調整対ハイブリッド検索

At this point, it’s worth highlighting that when someone references “training an LLM on enterprise data” they could be referring to either of two very different approaches.
この時点で、誰かが「企業データでの LLM の訓練」に言及するとき、それは 2 つのまったく異なるアプローチのいずれかを指している可能性があることを強調する価値があります。

The first approach is what is known as Fine Tuning. This is the approach most people have in mind when they think about “training an LLM” on enterprise data or for a specific task or domain. In fine-tuning, the enterprise's own data is used to update the weights of the base model using a framework like PyTorch or TensorFlow. The challenge with this approach is that it requires a high degree of in-house expertise and can be expensive enough to make it appropriate for only the largest organizations.
最初のアプローチは、微調整として知られるものです。これは、企業データまたは特定のタスクや領域に対する「LLM の訓練」について考えるときに、ほとんどの人が念頭に置くアプローチです。微調整では、企業独自のデータを使用して、PyTorch や TensorFlow などのフレームワークを使用して基礎モデルの重みを更新します。このアプローチの課題は、社内の高度な専門知識が必要であり、大規模な組織のみに適したものとなるには十分な費用がかかる可能性があることです。

The second approach does not actually involve training a base model at all and is more accurately called Hybrid Search. The approach is similar to how Bing Search has integrated traditional search with LLMs. In the hybrid search model, an enterprise search engine is used to first find relevant enterprise data. The search engine uses some form of cognitive search to identify documents (or components) that contain the user's query terms. This enterprise information is then turned into a prompt that is passed to the LLM, which then generates a conversational response to the user's query.

2 番目のアプローチは、実際には基礎モデルの訓練をまったく含まず、より正確にはハイブリッド検索と呼ばれます。このアプローチは、Bing Search が従来の検索を LLM と統合した方法と似ています。ハイブリッド検索モデルでは、企業検索エンジンを使用して、関連する企業データが最初に検索されます。検索エンジンは、何らかの形式の認知検索を使用して、ユーザーのクエリ用語を含むドキュメント (またはコンポーネント) を識別します。この企業情報はプロンプトに変換されて LLM に渡され、LLM はユーザーのクエリに対する会話型応答を生成します。

The speed of innovation in this area makes predictions difficult, but the hybrid search model may end up being more commonly deployed than the fine-tuning approach because it’s simpler and more cost effective for many enterprises.
この分野のイノベーションのスピードにより予測は困難になりますが、多くの企業にとってハイブリッド検索モデルのほうがシンプルでコスト効率が高いため、最終的には微調整アプローチよりも一般的に導入される可能性があります。

6 Reasons to use structured content
構造化コンテンツを使用すべき6つの理由

To return to the topic of this post, the key takeaway is that investing in properly structuring and enriching enterprise data results in substantially improved response accuracy, regardless of whether you pursue a fine-tuning or hybrid search approach.
この投稿の主題に戻ると、重要な点は、微調整検索アプローチを追求するかハイブリッド検索アプローチを追求するかに関係なく、企業データの適切な構造化と充実化に投資すると、応答精度が大幅に向上するということです。

Here are some ways that having your content in a format like DITA and enriching it with metadata connected to a taxonomy can result in better fine-tuning or hybrid search results and improved LLM performance:
コンテンツを DITA などの形式にし、分類法に関連付けられたメタデータでコンテンツを充実化することで、結果の微調整またはハイブリッド検索結果が向上し、LLM のパフォーマンスが向上する方法をいくつか紹介します。

1. Granular Information Extraction: DITA divides content into reusable and modular components. Tagging and including metadata at the component level during training enables LLMs to learn and understand these granular units of information. By fine-tuning on structured
data, the models develop the ability to extract and process specific components more accurately, resulting in better comprehension and accuracy. This concept also applies to the hybrid search scenario where the enterprise search engine will be able to incorporate vetted component-level content in the prompt passed to the LLM.
粒状の情報抽出: DITA は、コンテンツを再利用可能なモジュール型コンポーネントに分割します。訓練中にコンポーネントレベルでメタデータをタグ付けして含めることにより、LLM はこれら情報の粒状単位を学習して理解できるようになります。構造化データを微調整することで、モデルは特定のコンポーネントをより正確に抽出して処理する能力を開発し、その結果、理解力と精度が向上します。この概念は、企業検索エンジンが LLM に渡されるプロンプトに精査されたコンポーネントレベルのコンテンツを組み込むことができるハイブリッド検索シナリオにも当てはまります。

2. Structure and Context: Markup languages like DITA provide a structured way of representing information. By including markup in the training data, the language model can leverage the
hierarchical relationships, semantic meaning, and contextual information
構造と文脈: DITA のようなマークアップ言語は、情報を表現するための構造化された方法を提供します。訓練データにマークアップを含めることにより、言語モデルは、マークアップにエンコードされた階層関係、意味論的な意味、および文脈情報を活用できます。これにより、モデルが意図した構造と書式に準拠したテキストをよりよく理解し、生成することができます。

=========================

インターネット・コンピュータランキング
=========================
ネットサービスランキング
=========================

(2)構造化コンテンツで大規模言語モデル(LLM)を訓練すべき6つの理由