> ## Documentation Index
> Fetch the complete documentation index at: https://wb-21fd5541-docs-sandboxes-integrations-placement.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# RAG 애플리케이션 평가하기

> Weave와 LLM 기반 평가자를 사용해 RAG 애플리케이션을 구축하고 평가합니다

export const GitHubLink = ({url}) => <a href={url} target="_blank" rel="noopener noreferrer" className="github-source-link">
    <svg width="20" height="20" viewBox="0 0 24 24" fill="currentColor" xmlns="http://www.w3.org/2000/svg">
      <path d="M12 0C5.37 0 0 5.37 0 12c0 5.31 3.435 9.795 8.205 11.385.6.105.825-.255.825-.57 0-.285-.015-1.23-.015-2.235-3.015.555-3.795-.735-4.035-1.41-.135-.345-.72-1.41-1.23-1.695-.42-.225-1.02-.78-.015-.795.945-.015 1.62.87 1.845 1.23 1.08 1.815 2.805 1.305 3.495.99.105-.78.42-1.305.765-1.605-2.67-.3-5.46-1.335-5.46-5.925 0-1.305.465-2.385 1.23-3.225-.12-.3-.54-1.53.12-3.18 0 0 1.005-.315 3.3 1.23.96-.27 1.98-.405 3-.405s2.04.135 3 .405c2.295-1.56 3.3-1.23 3.3-1.23.66 1.65.24 2.88.12 3.18.765.84 1.23 1.905 1.23 3.225 0 4.605-2.805 5.625-5.475 5.925.435.375.81 1.095.81 2.22 0 1.605-.015 2.895-.015 3.3 0 .315.225.69.825.57A12.02 12.02 0 0024 12c0-6.63-5.37-12-12-12z" />
    </svg>
    GitHub 소스 코드
  </a>;

export const ColabLink = ({url}) => <a href={url} target="_blank" rel="noopener noreferrer" className="colab-link">
    <svg width="20" height="20" viewBox="0 0 24 24" fill="currentColor" xmlns="http://www.w3.org/2000/svg">
      <path d="M14.25.18l.9.2.73.26.59.3.45.32.34.34.25.34.16.33.1.3.04.26.02.2-.01.13V8.5l-.05.63-.13.55-.21.46-.26.38-.3.31-.33.25-.35.19-.35.14-.33.1-.3.07-.26.04-.21.02H8.77l-.69.05-.59.14-.5.22-.41.27-.33.32-.27.35-.2.36-.15.37-.1.35-.07.32-.04.27-.02.21v3.06H3.17l-.21-.03-.28-.07-.32-.12-.35-.18-.36-.26-.36-.36-.35-.46-.32-.59-.28-.73-.21-.88-.14-1.05-.05-1.23.06-1.22.16-1.04.24-.87.32-.71.36-.57.4-.44.42-.33.42-.24.4-.16.36-.1.32-.05.24-.01h.16l.06.01h8.16v-.83H6.18l-.01-2.75-.02-.37.05-.34.11-.31.17-.28.25-.26.31-.23.38-.2.44-.18.51-.15.58-.12.64-.1.71-.06.77-.04.84-.02 1.27.05zm-6.3 1.98l-.23.33-.08.41.08.41.23.34.33.22.41.09.41-.09.33-.22.23-.34.08-.41-.08-.41-.23-.33-.33-.22-.41-.09-.41.09zm13.09 3.95l.28.06.32.12.35.18.36.27.36.35.35.47.32.59.28.73.21.88.14 1.04.05 1.23-.06 1.23-.16 1.04-.24.86-.32.71-.36.57-.4.45-.42.33-.42.24-.4.16-.36.09-.32.05-.24.02-.16-.01h-8.22v.82h5.84l.01 2.76.02.36-.05.34-.11.31-.17.29-.25.25-.31.24-.38.2-.44.17-.51.15-.58.13-.64.09-.71.07-.77.04-.84.01-1.27-.04-1.07-.14-.9-.2-.73-.25-.59-.3-.45-.33-.34-.34-.25-.34-.16-.33-.1-.3-.04-.25-.02-.2.01-.13v-5.34l.05-.64.13-.54.21-.46.26-.38.3-.32.33-.24.35-.2.35-.14.33-.1.3-.06.26-.04.21-.02.13-.01h5.84l.69-.05.59-.14.5-.21.41-.28.33-.32.27-.35.2-.36.15-.36.1-.35.07-.32.04-.28.02-.21V6.07h2.09l.14.01.21.03zm-6.47 14.25l-.23.33-.08.41.08.41.23.33.33.23.41.08.41-.08.33-.23.23-.33.08-.41-.08-.41-.23-.33-.33-.23-.41-.08-.41.08z" />
    </svg>
    Colab에서 사용해 보기
  </a>;

<div style={{ display: 'flex', gap: '12px', flexWrap: 'wrap' }}>
  <ColabLink url="https://colab.research.google.com/github/wandb/docs/blob/main/weave/cookbooks/source/evaluate_rag_applications.ipynb" />

  <GitHubLink url="https://github.com/wandb/docs/blob/main/weave/cookbooks/source/evaluate_rag_applications.ipynb" />
</div>

검색 증강 생성(RAG)은 맞춤형 지식 베이스를 활용할 수 있는 생성형 AI 애플리케이션을 구축하는 일반적인 방법입니다. 이 튜토리얼에서는 RAG 애플리케이션을 구축하고, Weave를 사용해 검색 step을 추적하고, LLM 기반 평가자로 응답을 평가하는 방법을 안내합니다. 이를 통해 애플리케이션이 반환하는 답변의 품질을 측정하고 개선할 수 있습니다.

이 가이드는 파이프라인에 관측성과 체계적인 평가를 추가하려는 RAG 애플리케이션 개발자를 위한 것입니다.

<img src="https://mintcdn.com/wb-21fd5541-docs-sandboxes-integrations-placement/o_NUj-zO1if2NqBx/images/evals-hero.png?fit=max&auto=format&n=o_NUj-zO1if2NqBx&q=85&s=281b0030600fbd0124ebe7ff61d4c523" alt="Evals 대표 이미지" width="4100" height="2160" data-path="images/evals-hero.png" />

<div id="what-youll-learn">
  ## 학습할 내용
</div>

이 가이드에서는 다음을 설명합니다:

* 지식 베이스 구축.
* 관련 문서를 찾는 검색 step이 포함된 RAG 애플리케이션 만들기.
* Weave로 검색 step 추적.
* 컨텍스트 정밀도를 측정하기 위해 LLM 기반 평가자를 사용해 RAG 애플리케이션 평가.
* 맞춤형 점수 함수 정의.

<div id="prerequisites">
  ## 사전 요구 사항
</div>

* [W\&B 계정](https://wandb.ai/signup)
* Python 3.10+ 또는 Node.js 18+
* 필수 패키지가 설치되어 있어야 합니다:
  * **Python**: `pip install weave openai`
  * **TypeScript**: `npm install weave openai`
* [OpenAI API 키](https://platform.openai.com/api-keys)를 환경 변수로 설정해야 합니다.

<div id="build-a-knowledge-base">
  ## 지식 베이스 구축하기
</div>

지식 베이스는 RAG 애플리케이션이 쿼리 시점에 검색하는 코퍼스입니다. 이 섹션에서는 소규모 아티클 집합에 대한 벡터 임베딩을 계산하여, 나중에 주어진 질문에 가장 관련성이 높은 아티클을 가져올 수 있도록 합니다.

먼저 아티클의 임베딩을 계산합니다. 일반적으로는 아티클별로 이 작업을 한 번만 수행한 뒤 임베딩과 메타데이터를 데이터베이스에 저장하지만, 여기서는 단순화를 위해 스크립트를 실행할 때마다 수행합니다.

<Tabs>
  <Tab title="Python">
    ```python lines theme={null}
    from openai import OpenAI
    import weave
    from weave import Model
    import numpy as np
    import json
    import asyncio

    articles = [
        "Novo Nordisk and Eli Lilly rival soars 32 percent after promising weight loss drug results Shares of Denmarks Zealand Pharma shot 32 percent higher in morning trade, after results showed success in its liver disease treatment survodutide, which is also on trial as a drug to treat obesity. The trial “tells us that the 6mg dose is safe, which is the top dose used in the ongoing [Phase 3] obesity trial too,” one analyst said in a note. The results come amid feverish investor interest in drugs that can be used for weight loss.",
        "Berkshire shares jump after big profit gain as Buffetts conglomerate nears $1 trillion valuation Berkshire Hathaway shares rose on Monday after Warren Buffetts conglomerate posted strong earnings for the fourth quarter over the weekend. Berkshires Class A and B shares jumped more than 1.5%, each. Class A shares are higher by more than 17% this year, while Class B has gained more than 18%. Berkshire was last valued at $930.1 billion, up from $905.5 billion where it closed on Friday, according to FactSet. Berkshire on Saturday posted fourth-quarter operating earnings of $8.481 billion, about 28 percent higher than the $6.625 billion from the year-ago period, driven by big gains in its insurance business. Operating earnings refers to profits from businesses across insurance, railroads and utilities. Meanwhile, Berkshires cash levels also swelled to record levels. The conglomerate held $167.6 billion in cash in the fourth quarter, surpassing the $157.2 billion record the conglomerate held in the prior quarter.",
        "Highmark Health says its combining tech from Google and Epic to give doctors easier access to information Highmark Health announced it is integrating technology from Google Cloud and the health-care software company Epic Systems. The integration aims to make it easier for both payers and providers to access key information they need, even if its stored across multiple points and formats, the company said. Highmark is the parent company of a health plan with 7 million members, a provider network of 14 hospitals and other entities",
        "Rivian and Lucid shares plunge after weak EV earnings reports Shares of electric vehicle makers Rivian and Lucid fell Thursday after the companies reported stagnant production in their fourth-quarter earnings after the bell Wednesday. Rivian shares sank about 25 percent, and Lucids stock dropped around 17 percent. Rivian forecast it will make 57,000 vehicles in 2024, slightly less than the 57,232 vehicles it produced in 2023. Lucid said it expects to make 9,000 vehicles in 2024, more than the 8,428 vehicles it made in 2023.",
        "Mauritius blocks Norwegian cruise ship over fears of a potential cholera outbreak Local authorities on Sunday denied permission for the Norwegian Dawn ship, which has 2,184 passengers and 1,026 crew on board, to access the Mauritius capital of Port Louis, citing “potential health risks.” The Mauritius Ports Authority said Sunday that samples were taken from at least 15 passengers on board the cruise ship. A spokesperson for the U.S.-headquartered Norwegian Cruise Line Holdings said Sunday that 'a small number of guests experienced mild symptoms of a stomach-related illness' during Norwegian Dawns South Africa voyage.",
        "Intuitive Machines lands on the moon in historic first for a U.S. company Intuitive Machines Nova-C cargo lander, named Odysseus after the mythological Greek hero, is the first U.S. spacecraft to soft land on the lunar surface since 1972. Intuitive Machines is the first company to pull off a moon landing — government agencies have carried out all previously successful missions. The company's stock surged in extended trading Thursday, after falling 11 percent in regular trading.",
        "Lunar landing photos: Intuitive Machines Odysseus sends back first images from the moon Intuitive Machines cargo moon lander Odysseus returned its first images from the surface. Company executives believe the lander caught its landing gear sideways on the moon's surface while touching down and tipped over. Despite resting on its side, the company's historic IM-1 mission is still operating on the moon.",
    ]

    def docs_to_embeddings(docs: list) -> list:
        openai = OpenAI()
        document_embeddings = []
        for doc in docs:
            response = (
                openai.embeddings.create(input=doc, model="text-embedding-3-small")
                .data[0]
                .embedding
            )
            document_embeddings.append(response)
        return document_embeddings

    article_embeddings = docs_to_embeddings(articles) # 참고: 일반적으로는 아티클별로 이 작업을 한 번만 수행한 뒤 임베딩과 메타데이터를 데이터베이스에 저장합니다
    ```
  </Tab>

  <Tab title="TypeScript">
    ```typescript twoslash lines theme={null}
    // @noErrors
    require('dotenv').config();
    import { OpenAI } from 'openai';
    import * as weave from 'weave';

    interface Article {
        text: string;
        embedding?: number[];
    }

    const articles: Article[] = [
        { 
            text: `Novo Nordisk and Eli Lilly rival soars 32 percent after promising weight loss drug results Shares of Denmarks Zealand Pharma shot 32 percent higher in morning trade, after results showed success in its liver disease treatment survodutide, which is also on trial as a drug to treat obesity. The trial tells us that the 6mg dose is safe, which is the top dose used in the ongoing [Phase 3] obesity trial too, one analyst said in a note. The results come amid feverish investor interest in drugs that can be used for weight loss.`
        },
        { 
            text: `Berkshire shares jump after big profit gain as Buffetts conglomerate nears $1 trillion valuation Berkshire Hathaway shares rose on Monday after Warren Buffetts conglomerate posted strong earnings for the fourth quarter over the weekend. Berkshires Class A and B shares jumped more than 1.5%, each. Class A shares are higher by more than 17% this year, while Class B has gained more than 18%. Berkshire was last valued at $930.1 billion, up from $905.5 billion where it closed on Friday, according to FactSet. Berkshire on Saturday posted fourth-quarter operating earnings of $8.481 billion, about 28 percent higher than the $6.625 billion from the year-ago period, driven by big gains in its insurance business. Operating earnings refers to profits from businesses across insurance, railroads and utilities. Meanwhile, Berkshires cash levels also swelled to record levels. The conglomerate held $167.6 billion in cash in the fourth quarter, surpassing the $157.2 billion record the conglomerate held in the prior quarter.`
        },
        { 
            text: `Highmark Health says its combining tech from Google and Epic to give doctors easier access to information Highmark Health announced it is integrating technology from Google Cloud and the health-care software company Epic Systems. The integration aims to make it easier for both payers and providers to access key information they need, even if its stored across multiple points and formats, the company said. Highmark is the parent company of a health plan with 7 million members, a provider network of 14 hospitals and other entities`
        }
    ];

    function cosineSimilarity(a: number[], b: number[]): number {
        const dotProduct = a.reduce((sum, val, i) => sum + val * b[i], 0);
        const magnitudeA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
        const magnitudeB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
        return dotProduct / (magnitudeA * magnitudeB);
    }

    const docsToEmbeddings = weave.op(async function(docs: Article[]): Promise<Article[]> {
        const openai = new OpenAI();
        const enrichedDocs = await Promise.all(docs.map(async (doc) => {
            const response = await openai.embeddings.create({
                input: doc.text,
                model: "text-embedding-3-small"
            });
            return {
                ...doc,
                embedding: response.data[0].embedding
            };
        }));
        return enrichedDocs;
    });
    ```
  </Tab>
</Tabs>

<div id="create-a-rag-app">
  ## RAG 앱 만들기
</div>

지식 베이스가 준비되었으므로 이제 RAG 애플리케이션 자체를 구축할 수 있습니다. 이 섹션에서는 검색 step와 LLM Call을 결합하고, 둘 다 Weave로 래핑하여 모든 입력과 출력이 자동으로 추적되도록 합니다.

다음으로, 검색 함수 `get_most_relevant_document`를 `weave.op()` 데코레이터로 래핑하고 `Model` 클래스를 만듭니다. 검색 함수를 `weave.op()`로 래핑하면 Weave가 모든 call에 대한 inputs와 outputs를 캡처할 수 있으며, 이로써 나중에 검색 step을 검사할 수 있게 됩니다. `weave.init('<team-name>/rag-quickstart')`를 호출해 나중에 확인할 수 있도록 함수의 모든 inputs와 outputs 추적을 시작합니다. 팀 이름을 지정하지 않으면 출력은 [W\&B 기본 team 또는 entity](/ko/platform/app/settings-page/user-settings#default-team)에 기록됩니다.

<Tabs>
  <Tab title="Python">
    ```python lines theme={null}
    from openai import OpenAI
    import weave
    from weave import Model
    import numpy as np
    import asyncio

    @weave.op()
    def get_most_relevant_document(query):
        openai = OpenAI()
        query_embedding = (
            openai.embeddings.create(input=query, model="text-embedding-3-small")
            .data[0]
            .embedding
        )
        similarities = [
            np.dot(query_embedding, doc_emb)
            / (np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb))
            for doc_emb in article_embeddings
        ]
        # 가장 유사한 문서의 인덱스를 조회합니다
        most_relevant_doc_index = np.argmax(similarities)
        return articles[most_relevant_doc_index]

    class RAGModel(Model):
        system_message: str
        model_name: str = "gpt-3.5-turbo-1106"

        @weave.op()
        def predict(self, question: str) -> dict: # 참고: `question`은 나중에 evaluation 행에서 데이터를 선택하는 데 사용됩니다
            from openai import OpenAI
            context = get_most_relevant_document(question)
            client = OpenAI()
            query = f"""Use the following information to answer the subsequent question. If the answer cannot be found, write "I don't know."
            Context:
            \"\"\"
            {context}
            \"\"\"
            Question: {question}"""
            response = client.chat.completions.create(
                model=self.model_name,
                messages=[
                    {"role": "system", "content": self.system_message},
                    {"role": "user", "content": query},
                ],
                temperature=0.0,
                response_format={"type": "text"},
            )
            answer = response.choices[0].message.content
            return {'answer': answer, 'context': context}

    # 팀 및 프로젝트 이름을 설정합니다
    weave.init('<team-name>/rag-quickstart')
    model = RAGModel(
        system_message="You are an expert in finance and answer questions related to finance, financial services, and financial markets. When responding based on provided information, be sure to cite the source."
    )
    model.predict("What significant result was reported about Zealand Pharma's obesity trial?")
    ```
  </Tab>

  <Tab title="TypeScript">
    ```typescript twoslash lines theme={null}
    // @noErrors
    class RAGModel {
        private openai: OpenAI;
        private systemMessage: string;
        private modelName: string;
        private articleEmbeddings: Article[];

        constructor(config: {
            systemMessage: string;
            modelName?: string;
            articleEmbeddings: Article[];
        }) {
            this.openai = new OpenAI();
            this.systemMessage = config.systemMessage;
            this.modelName = config.modelName || "gpt-3.5-turbo-1106";
            this.articleEmbeddings = config.articleEmbeddings;
            this.predict = weave.op(this, this.predict);
        }

        async predict(question: string): Promise<{
            answer: string;
            context: string;
        }> {
            const context = await this.getMostRelevantDocument(question);

            const response = await this.openai.chat.completions.create({
                model: this.modelName,
                messages: [
                    { role: "system", content: this.systemMessage },
                    { role: "user", content: `Use the following information to answer the subsequent question. If the answer cannot be found, write "I don't know."
                        Context:
                        """
                        ${context}
                        """
                        Question: ${question}` }
                ],
                temperature: 0
            });

            return {
                answer: response.choices[0].message.content || "",
                context
            };
        }
    }
    ```
  </Tab>
</Tabs>

<div id="evaluate-with-an-llm-judge">
  ## LLM 기반 평가자로 평가하기
</div>

이제 RAG 애플리케이션이 실행되고 Weave로 추적되고 있으므로, 다음 단계는 질문에 얼마나 잘 답하는지 평가하는 것입니다. 이 섹션에서는 LLM을 자동 평가자로 사용하는 방법을 보여주므로, 레이블을 수동으로 작성하지 않고도 애플리케이션의 응답에 점수를 매길 수 있습니다.

애플리케이션을 평가할 단순한 방법이 없을 때는, LLM을 사용해 애플리케이션의 일부 측면을 평가하는 것도 한 가지 방법입니다. 다음은 주어진 답변에 도달하는 데 컨텍스트가 유용했는지 확인하도록 프롬프트해, 컨텍스트 정밀도를 측정하는 LLM 기반 평가자 사용 예시입니다. 이 프롬프트는 널리 사용되는 [RAGAS 프레임워크](https://docs.ragas.io/en/stable/)를 바탕으로 보강되었습니다.

<div id="define-a-scoring-function">
  ### 점수 함수 정의하기
</div>

[Build an Evaluation pipeline tutorial](/ko/weave/tutorial-eval)과 마찬가지로, 앱을 테스트할 예시 행 집합과 점수 함수를 정의합니다. 점수 함수는 각 행을 하나씩 받아 평가합니다. 입력 매개변수는 해당 행의 키와 일치해야 하므로, 여기서 `question`은 행 딕셔너리에서 가져옵니다. `output`은 모델의 출력입니다. 모델 입력도 입력 인수에 따라 예시에서 가져오므로, 여기서 역시 `question`이 사용됩니다. 이 예제는 `async` 함수를 사용하므로 병렬로 실행됩니다. async에 대한 간단한 소개는 [Python `asyncio` 문서](https://docs.python.org/3/library/asyncio.html)를 참고하세요.

<Tabs>
  <Tab title="Python">
    ```python lines theme={null}
    from openai import OpenAI
    import weave
    import asyncio

    @weave.op()
    async def context_precision_score(question, output):
        context_precision_prompt = """Given question, answer and context verify if the context was useful in arriving at the given answer. Give verdict as "1" if useful and "0" if not with json output.
        Output in only valid JSON format.

        question: {question}
        context: {context}
        answer: {answer}
        verdict: """
        client = OpenAI()

        prompt = context_precision_prompt.format(
            question=question,
            context=output['context'],
            answer=output['answer'],
        )

        response = client.chat.completions.create(
            model="gpt-4-turbo-preview",
            messages=[{"role": "user", "content": prompt}],
            response_format={ "type": "json_object" }
        )
        response_message = response.choices[0].message
        response = json.loads(response_message.content)
        return {
            "verdict": int(response["verdict"]) == 1,
        }

    questions = [
        {"question": "What significant result was reported about Zealand Pharma's obesity trial?"},
        {"question": "How much did Berkshire Hathaway's cash levels increase in the fourth quarter?"},
        {"question": "What is the goal of Highmark Health's integration of Google Cloud and Epic Systems technology?"},
        {"question": "What were Rivian and Lucid's vehicle production forecasts for 2024?"},
        {"question": "Why was the Norwegian Dawn cruise ship denied access to Mauritius?"},
        {"question": "Which company achieved the first U.S. moon landing since 1972?"},
        {"question": "What issue did Intuitive Machines' lunar lander encounter upon landing on the moon?"}
    ]
    evaluation = weave.Evaluation(dataset=questions, scorers=[context_precision_score])
    asyncio.run(evaluation.evaluate(model)) # 참고: 평가할 모델을 정의해야 합니다
    ```
  </Tab>

  <Tab title="TypeScript">
    ```typescript twoslash lines theme={null}
    // @noErrors
    const contextPrecisionScore = weave.op(async function(args: {
        datasetRow: QuestionRow;
        modelOutput: { answer: string; context: string; }
    }): Promise<ScorerResult> {
        const openai = new OpenAI();
        
        const prompt = `Given question, answer and context verify if the context was useful...`;

        const response = await openai.chat.completions.create({
            model: "gpt-4-turbo-preview",
            messages: [{ role: "user", content: prompt }],
            response_format: { type: "json_object" }
        });

        const result = JSON.parse(response.choices[0].message.content || "{}");
        return {
            verdict: parseInt(result.verdict) === 1
        };
    });

    const evaluation = new weave.Evaluation({
        dataset: createQuestionDataset(),
        scorers: [contextPrecisionScore]
    });

    await evaluation.evaluate({
        model: weave.op((args: { datasetRow: QuestionRow }) => 
            model.predict(args.datasetRow.question)
        )
    });
    ```
  </Tab>
</Tabs>

<div id="optional-define-a-scorer-class">
  ### 선택 사항: `Scorer` 클래스 정의하기
</div>

이전 섹션의 점수 함수는 단순한 경우에는 잘 작동하지만, 여러 평가에서 동일한 judge를 재사용하거나 점수 집계 방식을 사용자 지정하려는 경우에는 `Scorer` 클래스가 유용합니다. 다음 단계에서는 이를 언제, 어떻게 정의하는지 보여줍니다.

일부 애플리케이션에서는 맞춤형 평가 클래스를 만들고 싶을 수 있습니다. 예를 들어 특정 매개변수(예: chat model 및 prompt), 각 행에 대한 특정 채점 방식, 집계 점수의 특정 계산 방식을 갖는 표준화된 `LLMJudge` 클래스를 만들어야 할 수 있습니다. Weave는 바로 사용할 수 있는 `Scorer` 클래스 목록을 제공하며, 맞춤형 `Scorer`를 쉽게 만들 수 있도록 지원합니다. 다음 예제는 맞춤형 `class CorrectnessLLMJudge(Scorer)`를 만드는 방법을 보여줍니다.

개략적으로 보면 맞춤형 Scorer를 만드는 단계는 다음과 같습니다:

1. `weave.flow.scorer.Scorer`를 상속하는 맞춤형 클래스를 정의합니다.
2. `score` 함수를 재정의하고, 함수의 각 call을 추적하려면 `@weave.op()`을 추가합니다.
   * 이 함수는 모델의 예측 결과가 전달될 `output` 인수를 정의해야 합니다. 모델이 "None"을 반환하는 경우를 대비해 유형을 `Optional[dict]`로 정의하세요.
   * 나머지 인수는 일반 `Any` 또는 `dict`일 수도 있고, `weave.Evaluate` 클래스를 사용해 모델을 평가하는 데 사용되는 데이터셋에서 특정 column을 선택할 수도 있습니다. 이 인수들의 이름은 column 이름과 정확히 같아야 하며, `preprocess_model_input`을 사용하는 경우에는 그 처리를 거친 뒤 단일 행의 키 이름과 정확히 일치해야 합니다.
3. 선택 사항: 집계 점수 계산을 사용자 지정하려면 `summarize` 함수를 재정의합니다. 기본적으로 맞춤형 함수를 정의하지 않으면 Weave는 `weave.flow.scorer.auto_summarize` 함수를 사용합니다.
   * 이 함수에는 `@weave.op()` decorator가 있어야 합니다.

<Tabs>
  <Tab title="Python">
    ```python lines theme={null}
    from weave import Scorer

    class CorrectnessLLMJudge(Scorer):
        prompt: str
        model_name: str
        device: str

        @weave.op()
        async def score(self, output: Optional[dict], query: str, answer: str) -> Any:
            """pred, query, target를 비교하여 예측의 정확성을 점수화합니다.
            Args:
                - output: 평가 대상 모델이 제공하는 dict
                - query: 데이터셋에 정의된 질문
                - answer: 데이터셋에 정의된 정답
            Returns:
                - 단일 dict {metric name: 단일 평가 값}"""

            # get_model은 제공된 매개변수(OpenAI,HF...)를 기반으로 하는 일반적인 모델 getter로 정의됩니다
            eval_model = get_model(
                model_name = self.model_name,
                prompt = self.prompt
                device = self.device,
            )
            # 평가 속도를 높이기 위한 비동기 평가입니다. 반드시 async일 필요는 없습니다
            grade = await eval_model.async_predict(
                {
                    "query": query,
                    "answer": answer,
                    "result": output.get("result"),
                }
            )
            # 출력 파싱 - pydantic을 사용하면 더 견고하게 처리할 수 있습니다
            evaluation = "incorrect" not in grade["text"].strip().lower()

            # Weave에 표시되는 column 이름
            return {"correct": evaluation}

        @weave.op()
        def summarize(self, score_rows: list) -> Optional[dict]:
            """점수화 함수가 각 행에 대해 계산한 모든 점수를 집계합니다.
            Args:
                - score_rows: dict 목록입니다. 각 dict에는 메트릭과 점수가 있습니다
            Returns:
                - 입력과 동일한 구조의 중첩 dict"""

            # 아무것도 제공되지 않으면 weave.flow.scorer.auto_summarize 함수가 사용됩니다
            # return auto_summarize(score_rows)

            valid_data = [x.get("correct") for x in score_rows if x.get("correct") is not None]
            count_true = list(valid_data).count(True)
            int_data = [int(x) for x in valid_data]

            sample_mean = np.mean(int_data) if int_data else 0
            sample_variance = np.var(int_data) if int_data else 0
            sample_error = np.sqrt(sample_variance / len(int_data)) if int_data else 0

            # 추가 "correct" 계층은 필수는 아니지만 UI에 구조를 더해줍니다
            return {
                "correct": {
                    "true_count": count_true,
                    "true_fraction": sample_mean,
                    "stderr": sample_error,
                }
            }
    ```
  </Tab>

  <Tab title="TypeScript">
    ```plaintext theme={null}
    이 기능은 아직 TypeScript에서 사용할 수 없습니다.
    ```
  </Tab>
</Tabs>

이를 Scorer로 사용하려면 먼저 초기화한 후, 다음과 같이 `Evaluation`의 `scorers` 인수로 전달하세요:

<Tabs>
  <Tab title="Python">
    ```python lines theme={null}
    evaluation = weave.Evaluation(dataset=questions, scorers=[CorrectnessLLMJudge()])
    ```
  </Tab>

  <Tab title="TypeScript">
    ```plaintext theme={null}
    이 기능은 아직 TypeScript에서 사용할 수 없습니다.
    ```
  </Tab>
</Tabs>

<div id="put-it-all-together">
  ## 전체를 한데 모아보기
</div>

이 섹션에서는 앞선 단계의 모든 내용을 하나의 엔드투엔드 예시로 통합해, 각 요소가 어떻게 맞물리는지 보여주고 이를 자신의 RAG 애플리케이션에 맞게 조정할 수 있도록 합니다.

RAG 앱에서 동일한 결과를 얻으려면 다음을 수행하세요.

* LLM Call과 검색 단계 함수를 `weave.op()`으로 래핑하세요.
* 선택: `predict` 함수와 앱 세부 정보를 포함하는 `Model` 하위 클래스를 만드세요.
* 평가할 예시를 수집하세요.
* 하나의 예시에 점수를 매기는 점수 함수를 만드세요.
* `Evaluation` 클래스를 사용해 예시에 대해 평가를 실행하세요.

**참고:** 경우에 따라 Evaluation의 비동기 실행으로 인해 OpenAI, Anthropic 등의 모델 API에서 요청 속도 제한에 걸릴 수 있습니다. 이를 방지하려면 병렬 워커 수를 제한하는 환경 변수를 설정할 수 있습니다. 예: `WEAVE_PARALLELISM=3`.

다음은 전체 코드입니다.

<Tabs>
  <Tab title="Python">
    ```python lines {34,52-77} theme={null}
    from openai import OpenAI
    import weave
    from weave import Model
    import numpy as np
    import json
    import asyncio

    # 평가에 사용할 예시
    articles = [
        "Novo Nordisk and Eli Lilly rival soars 32 percent after promising weight loss drug results Shares of Denmarks Zealand Pharma shot 32 percent higher in morning trade, after results showed success in its liver disease treatment survodutide, which is also on trial as a drug to treat obesity. The trial “tells us that the 6mg dose is safe, which is the top dose used in the ongoing [Phase 3] obesity trial too,” one analyst said in a note. The results come amid feverish investor interest in drugs that can be used for weight loss.",
        "Berkshire shares jump after big profit gain as Buffetts conglomerate nears $1 trillion valuation Berkshire Hathaway shares rose on Monday after Warren Buffetts conglomerate posted strong earnings for the fourth quarter over the weekend. Berkshires Class A and B shares jumped more than 1.5%, each. Class A shares are higher by more than 17% this year, while Class B has gained more than 18%. Berkshire was last valued at $930.1 billion, up from $905.5 billion where it closed on Friday, according to FactSet. Berkshire on Saturday posted fourth-quarter operating earnings of $8.481 billion, about 28 percent higher than the $6.625 billion from the year-ago period, driven by big gains in its insurance business. Operating earnings refers to profits from businesses across insurance, railroads and utilities. Meanwhile, Berkshires cash levels also swelled to record levels. The conglomerate held $167.6 billion in cash in the fourth quarter, surpassing the $157.2 billion record the conglomerate held in the prior quarter.",
        "Highmark Health says its combining tech from Google and Epic to give doctors easier access to information Highmark Health announced it is integrating technology from Google Cloud and the health-care software company Epic Systems. The integration aims to make it easier for both payers and providers to access key information they need, even if it's stored across multiple points and formats, the company said. Highmark is the parent company of a health plan with 7 million members, a provider network of 14 hospitals and other entities",
        "Rivian and Lucid shares plunge after weak EV earnings reports Shares of electric vehicle makers Rivian and Lucid fell Thursday after the companies reported stagnant production in their fourth-quarter earnings after the bell Wednesday. Rivian shares sank about 25 percent, and Lucids stock dropped around 17 percent. Rivian forecast it will make 57,000 vehicles in 2024, slightly less than the 57,232 vehicles it produced in 2023. Lucid said it expects to make 9,000 vehicles in 2024, more than the 8,428 vehicles it made in 2023.",
        "Mauritius blocks Norwegian cruise ship over fears of a potential cholera outbreak Local authorities on Sunday denied permission for the Norwegian Dawn ship, which has 2,184 passengers and 1,026 crew on board, to access the Mauritius capital of Port Louis, citing “potential health risks.” The Mauritius Ports Authority said Sunday that samples were taken from at least 15 passengers on board the cruise ship. A spokesperson for the U.S.-headquartered Norwegian Cruise Line Holdings said Sunday that 'a small number of guests experienced mild symptoms of a stomach-related illness' during Norwegian Dawns South Africa voyage.",
        "Intuitive Machines lands on the moon in historic first for a U.S. company Intuitive Machines Nova-C cargo lander, named Odysseus after the mythological Greek hero, is the first U.S. spacecraft to soft land on the lunar surface since 1972. Intuitive Machines is the first company to pull off a moon landing — government agencies have carried out all previously successful missions. The company's stock surged in extended trading Thursday, after falling 11 percent in regular trading.",
        "Lunar landing photos: Intuitive Machines Odysseus sends back first images from the moon Intuitive Machines cargo moon lander Odysseus returned its first images from the surface. Company executives believe the lander caught its landing gear sideways on the surface of the moon while touching down and tipped over. Despite resting on its side, the company's historic IM-1 mission is still operating on the moon.",
    ]

    def docs_to_embeddings(docs: list) -> list:
        openai = OpenAI()
        document_embeddings = []
        for doc in docs:
            response = (
                openai.embeddings.create(input=doc, model="text-embedding-3-small")
                .data[0]
                .embedding
            )
            document_embeddings.append(response)
        return document_embeddings

    article_embeddings = docs_to_embeddings(articles) # Note: you would typically do this once with your articles and put the embeddings &amp; metadata in a database

    # 검색 단계에 decorator 추가
    @weave.op()
    def get_most_relevant_document(query):
        openai = OpenAI()
        query_embedding = (
            openai.embeddings.create(input=query, model="text-embedding-3-small")
            .data[0]
            .embedding
        )
        similarities = [
            np.dot(query_embedding, doc_emb)
            / (np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb))
            for doc_emb in article_embeddings
        ]
        # 가장 유사한 문서의 인덱스 조회
        most_relevant_doc_index = np.argmax(similarities)
        return articles[most_relevant_doc_index]

    # 앱 세부 정보와 응답을 생성하는 predict 함수를 포함한 Model 서브클래스 생성
    class RAGModel(Model):
        system_message: str
        model_name: str = "gpt-3.5-turbo-1106"

        @weave.op()
        def predict(self, question: str) -> dict: # note: `question` will be used later to select data from our evaluation rows
            from openai import OpenAI
            context = get_most_relevant_document(question)
            client = OpenAI()
            query = f"""Use the following information to answer the subsequent question. If the answer cannot be found, write "I don't know."
            Context:
            \"\"\"
            {context}
            \"\"\"
            Question: {question}"""
            response = client.chat.completions.create(
                model=self.model_name,
                messages=[
                    {"role": "system", "content": self.system_message},
                    {"role": "user", "content": query},
                ],
                temperature=0.0,
                response_format={"type": "text"},
            )
            answer = response.choices[0].message.content
            return {'answer': answer, 'context': context}

    # 팀 및 프로젝트 이름 설정
    weave.init('<team-name>/rag-quickstart')
    model = RAGModel(
        system_message="You are an expert in finance and answer questions related to finance, financial services, and financial markets. When responding based on provided information, be sure to cite the source."
    )

    # 질문과 출력을 사용하여 점수를 생성하는 점수화 함수
    @weave.op()
    async def context_precision_score(question, output):
        context_precision_prompt = """Given question, answer and context verify if the context was useful in arriving at the given answer. Give verdict as "1" if useful and "0" if not with json output.
        Output in only valid JSON format.

        question: {question}
        context: {context}
        answer: {answer}
        verdict: """
        client = OpenAI()

        prompt = context_precision_prompt.format(
            question=question,
            context=output['context'],
            answer=output['answer'],
        )

        response = client.chat.completions.create(
            model="gpt-4-turbo-preview",
            messages=[{"role": "user", "content": prompt}],
            response_format={ "type": "json_object" }
        )
        response_message = response.choices[0].message
        response = json.loads(response_message.content)
        return {
            "verdict": int(response["verdict"]) == 1,
        }

    questions = [
        {"question": "What significant result was reported about Zealand Pharma's obesity trial?"},
        {"question": "How much did Berkshire Hathaway's cash levels increase in the fourth quarter?"},
        {"question": "What is the goal of Highmark Health's integration of Google Cloud and Epic Systems technology?"},
        {"question": "What were Rivian and Lucid's vehicle production forecasts for 2024?"},
        {"question": "Why was the Norwegian Dawn cruise ship denied access to Mauritius?"},
        {"question": "Which company achieved the first U.S. moon landing since 1972?"},
        {"question": "What issue did Intuitive Machines' lunar lander encounter upon landing on the moon?"}
    ]

    # Evaluation 객체를 정의하고 예시 질문과 점수화 함수를 전달
    evaluation = weave.Evaluation(dataset=questions, scorers=[context_precision_score])
    asyncio.run(evaluation.evaluate(model))
    ```
  </Tab>

  <Tab title="TypeScript">
    ```typescript twoslash lines theme={null}
    // @noErrors
    require('dotenv').config();
    import { OpenAI } from 'openai';
    import * as weave from 'weave';

    interface Article {
        text: string;
        embedding?: number[];
    }

    const articles: Article[] = [
        { 
            text: `Novo Nordisk and Eli Lilly rival soars 32 percent after promising weight loss drug results Shares of Denmarks Zealand Pharma shot 32 percent higher in morning trade, after results showed success in its liver disease treatment survodutide, which is also on trial as a drug to treat obesity. The trial tells us that the 6mg dose is safe, which is the top dose used in the ongoing [Phase 3] obesity trial too, one analyst said in a note. The results come amid feverish investor interest in drugs that can be used for weight loss.`
        },
        { 
            text: `Berkshire shares jump after big profit gain as Buffetts conglomerate nears $1 trillion valuation Berkshire Hathaway shares rose on Monday after Warren Buffetts conglomerate posted strong earnings for the fourth quarter over the weekend. Berkshires Class A and B shares jumped more than 1.5%, each. Class A shares are higher by more than 17% this year, while Class B has gained more than 18%. Berkshire was last valued at $930.1 billion, up from $905.5 billion where it closed on Friday, according to FactSet. Berkshire on Saturday posted fourth-quarter operating earnings of $8.481 billion, about 28 percent higher than the $6.625 billion from the year-ago period, driven by big gains in its insurance business. Operating earnings refers to profits from businesses across insurance, railroads and utilities. Meanwhile, Berkshires cash levels also swelled to record levels. The conglomerate held $167.6 billion in cash in the fourth quarter, surpassing the $157.2 billion record the conglomerate held in the prior quarter.`
        },
        { 
            text: `Highmark Health says its combining tech from Google and Epic to give doctors easier access to information Highmark Health announced it is integrating technology from Google Cloud and the health-care software company Epic Systems. The integration aims to make it easier for both payers and providers to access key information they need, even if its stored across multiple points and formats, the company said. Highmark is the parent company of a health plan with 7 million members, a provider network of 14 hospitals and other entities`
        }
    ];

    function cosineSimilarity(a: number[], b: number[]): number {
        const dotProduct = a.reduce((sum, val, i) => sum + val * b[i], 0);
        const magnitudeA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
        const magnitudeB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
        return dotProduct / (magnitudeA * magnitudeB);
    }

    const docsToEmbeddings = weave.op(async function(docs: Article[]): Promise<Article[]> {
        const openai = new OpenAI();
        const enrichedDocs = await Promise.all(docs.map(async (doc) => {
            const response = await openai.embeddings.create({
                input: doc.text,
                model: "text-embedding-3-small"
            });
            return {
                ...doc,
                embedding: response.data[0].embedding
            };
        }));
        return enrichedDocs;
    });

    class RAGModel {
        private openai: OpenAI;
        private systemMessage: string;
        private modelName: string;
        private articleEmbeddings: Article[];

        constructor(config: {
            systemMessage: string;
            modelName?: string;
            articleEmbeddings: Article[];
        }) {
            this.openai = new OpenAI();
            this.systemMessage = config.systemMessage;
            this.modelName = config.modelName || "gpt-3.5-turbo-1106";
            this.articleEmbeddings = config.articleEmbeddings;
            this.predict = weave.op(this, this.predict);
        }

        private async getMostRelevantDocument(query: string): Promise<string> {
            const queryEmbedding = await this.openai.embeddings.create({
                input: query,
                model: "text-embedding-3-small"
            });

            const similarities = this.articleEmbeddings.map(doc => {
                if (!doc.embedding) return 0;
                return cosineSimilarity(queryEmbedding.data[0].embedding, doc.embedding);
            });

            const mostRelevantIndex = similarities.indexOf(Math.max(...similarities));
            return this.articleEmbeddings[mostRelevantIndex].text;
        }

        async predict(question: string): Promise<{
            answer: string;
            context: string;
        }> {
            const context = await this.getMostRelevantDocument(question);
            
            const response = await this.openai.chat.completions.create({
                model: this.modelName,
                messages: [
                    { role: "system", content: this.systemMessage },
                    { 
                        role: "user", 
                        content: `Use the following information to answer the subsequent question. If the answer cannot be found, write "I don't know."
                        Context:
                        """
                        ${context}
                        """
                        Question: ${question}`
                    }
                ],
                temperature: 0
            });

            return {
                answer: response.choices[0].message.content || "",
                context
            };
        }
    }

    interface ScorerResult {
        verdict: boolean;
    }

    interface QuestionRow {
        question: string;
    }

    function createQuestionDataset(): weave.Dataset<QuestionRow> {
        return new weave.Dataset<QuestionRow>({
            id: 'rag-questions',
            rows: [
                { question: "What significant result was reported about Zealand Pharma's obesity trial?" },
                { question: "How much did Berkshire Hathaway's cash levels increase in the fourth quarter?" },
                { question: "What is the goal of Highmark Health's integration of Google Cloud and Epic Systems technology?" }
            ]
        });
    }

    const contextPrecisionScore = weave.op(async function(args: {
        datasetRow: QuestionRow;
        modelOutput: { answer: string; context: string; }
    }): Promise<ScorerResult> {
        const openai = new OpenAI();
        
        const prompt = `Given question, answer and context verify if the context was useful in arriving at the given answer. Give verdict as "1" if useful and "0" if not with json output.
        Output in only valid JSON format.

        question: ${args.datasetRow.question}
        context: ${args.modelOutput.context}
        answer: ${args.modelOutput.answer}
        verdict: `;

        const response = await openai.chat.completions.create({
            model: "gpt-4-turbo-preview",
            messages: [{ role: "user", content: prompt }],
            response_format: { type: "json_object" }
        });

        const result = JSON.parse(response.choices[0].message.content || "{}");
        return {
            verdict: parseInt(result.verdict) === 1
        };
    });

    async function main() {
        # 팀 및 프로젝트 이름을 설정하세요
        await weave.init('<team-name>/rag-quickstart');
        
        const articleEmbeddings = await docsToEmbeddings(articles);
        
        const model = new RAGModel({
            systemMessage: "You are an expert in finance and answer questions related to finance, financial services, and financial markets. When responding based on provided information, be sure to cite the source.",
            articleEmbeddings
        });

        const evaluation = new weave.Evaluation({
            dataset: createQuestionDataset(),
            scorers: [contextPrecisionScore]
        });

        const results = await evaluation.evaluate({
            model: weave.op((args: { datasetRow: QuestionRow }) => 
                model.predict(args.datasetRow.question)
            )
        });
        
        console.log('Evaluation results:', results);
    }

    if (require.main === module) {
        main().catch(console.error);
    }
    ```
  </Tab>
</Tabs>

<div id="conclusion">
  ## 결론
</div>

이 튜토리얼에서는 이 예제의 검색 step처럼 애플리케이션의 여러 단계에 관측성을 적용하는 방법을 살펴보았습니다. 또한 애플리케이션 응답을 자동으로 평가하기 위해 LLM 기반 평가자와 같은 더 복잡한 점수 함수를 만드는 방법도 배웠습니다.

<div id="next-steps">
  ## 다음 단계
</div>

엔지니어를 위한 실용적인 RAG 기법을 한층 더 깊이 있게 다루는 [RAG++ 과정](https://www.wandb.courses/courses/rag-in-production?utm_source=wandb_docs\&utm_medium=code\&utm_campaign=weave_docs)을 확인해 보세요. 이 과정에서는 W\&B, Cohere, Weaviate가 제공하는 프로덕션 환경에 바로 적용할 수 있는 솔루션을 통해 성능을 최적화하고, 비용을 절감하며, 애플리케이션의 정확도와 관련성을 높이는 방법을 배울 수 있습니다.