> ## Documentation Index > Fetch the complete documentation index at: https://wb-21fd5541-docs-sandboxes-integrations-placement.mintlify.site/llms.txt > Use this file to discover all available pages before exploring further. # W&B Weave 및 W&B Tables로 모델 평가하기 > W&B Weave와 W&B Tables를 사용해 머신 러닝 모델을 평가하는 방법을 알아보세요.

## Weave로 모델 평가하기

[W\&B Weave](/ko/weave)는 LLM 및 GenAI 애플리케이션을 평가하기 위해 특별히 설계된 툴킷입니다. Scorer, 평가자, 세부 트레이싱을 포함한 포괄적인 평가 기능을 제공하여 모델 성능을 이해하고 개선할 수 있도록 돕습니다. Weave는 W\&B Models와 통합되어 있어 모델 레지스트리에 저장된 모델을 평가할 수 있습니다. 모델 성능 메트릭과 트레이스를 보여주는 Weave 평가 대시보드

### 모델 평가의 주요 기능

* **Scorer와 평가자**: 정확성, 관련성, 일관성 등을 평가하기 위한 사전 제공 및 맞춤형 평가 메트릭 * **평가 데이터셋**: 체계적인 평가를 위한 정답이 포함된 구조화된 테스트 세트 * **모델 버전 관리**: 모델의 여러 버전을 추적하고 비교 * **상세 트레이싱**: 전체 입력/출력 트레이스로 모델 동작을 디버그 * **비용 추적**: 평가 전반에서 API 비용과 토큰 사용량을 모니터링

### 시작하기: W\&B Registry에서 모델 평가하기

W\&B Models Registry에서 모델을 다운로드한 다음 Weave로 평가하세요: ```python theme={null} import weave import wandb from typing import Any # Weave 초기화 weave.init("your-entity/your-project") # W&B 레지스트리에서 로드하는 ChatModel 정의 class ChatModel(weave.Model): model_name: str def model_post_init(self, __context): # W&B Models Registry에서 모델 다운로드 with wandb.init(project="your-project", job_type="model_download") as run: artifact = run.use_artifact(self.model_name) self.model_path = artifact.download() # 여기서 모델을 초기화하세요 @weave.op() async def predict(self, query: str) -> str: # 모델 추론 로직 return self.model.generate(query) # 평가 데이터셋 생성 dataset = weave.Dataset(name="eval_dataset", rows=[ {"input": "What is the capital of France?", "expected": "Paris"}, {"input": "What is 2+2?", "expected": "4"}, ]) # Scorer 정의 @weave.op() def exact_match_scorer(expected: str, output: str) -> dict: return {"correct": expected.lower() == output.lower()} # 평가 실행 model = ChatModel(model_name="wandb-entity/registry-name/model:version") evaluation = weave.Evaluation( dataset=dataset, scorers=[exact_match_scorer] ) results = await evaluation.evaluate(model) ```

### W\&B Models와 Weave 평가 통합하기

[Models and Weave Integration Demo](/ko/weave/cookbooks/Models_and_Weave_Integration_Demo)에서는 다음과 같은 전체 워크플로를 보여줍니다: 1. **레지스트리에서 모델 로드**: W\&B Models 레지스트리에 저장된 파인튜닝된 모델을 다운로드합니다 2. **평가 파이프라인 생성**: 맞춤형 Scorer를 사용해 포괄적인 평가를 구성합니다 3. **결과를 W\&B에 다시 로깅**: 평가 메트릭을 모델 run에 연결합니다 4. **평가된 모델 버전 관리**: 개선된 모델을 레지스트리에 다시 저장합니다 평가 결과를 Weave와 W\&B Models 모두에 로깅합니다: ```python theme={null} # W&B tracking으로 evaluation 실행 with weave.attributes({"wandb-run-id": wandb.run.id}): summary, call = await evaluation.evaluate.call(evaluation, model) # W&B Models에 메트릭 기록 wandb.run.log(summary) wandb.run.config.update({ "weave_eval_url": f"https://wandb.ai/{entity}/{project}/r/call/{call.id}" }) ```

### Weave의 고급 특성

#### 맞춤형 Scorer와 평가자

사용 사례에 맞는 정교한 평가 메트릭을 만드세요: ```python theme={null} @weave.op() def llm_judge_scorer(expected: str, output: str, judge_model) -> dict: prompt = f"Is this answer correct? Expected: {expected}, Got: {output}" judgment = await judge_model.predict(prompt) return {"judge_score": judgment} ```

#### 일괄 평가

여러 모델 버전 또는 설정을 평가해 보세요: ```python theme={null} models = [ ChatModel(model_name="model:v1"), ChatModel(model_name="model:v2"), ] for model in models: results = await evaluation.evaluate(model) print(f"{model.model_name}: {results}") ```

### 다음 단계

* [Weave 평가 튜토리얼 완료하기](/ko/weave/tutorial-eval/) * [Models 및 Weave 인테그레이션 예제](/ko/weave/cookbooks/Models_and_Weave_Integration_Demo)

## 테이블을 사용해 모델 평가하기

W\&B Tables를 사용하면 다음을 할 수 있습니다. * **모델 예측 비교**: 동일한 테스트 세트에서 여러 모델의 성능을 나란히 비교해 볼 수 있습니다 * **예측 변화 추적**: 트레이닝 에포크 또는 모델 버전에 따라 예측이 어떻게 달라지는지 모니터링합니다 * **오류 분석**: 필터와 쿼리를 사용해 자주 오분류되는 예시와 오류 패턴을 찾습니다 * **리치 미디어 시각화**: 이미지, Audio, 텍스트 및 기타 미디어 유형을 예측 및 메트릭과 함께 표시합니다 정답 레이블과 함께 모델 출력이 표시된 예측 테이블 예시

### 기본 예시: 평가 결과 로깅하기

```python theme={null} import wandb # run 초기화 run = wandb.init(project="model-evaluation") # 평가 결과가 담긴 테이블 생성 columns = ["id", "input", "ground_truth", "prediction", "confidence", "correct"] eval_table = wandb.Table(columns=columns) # 평가 데이터 추가 for idx, (input_data, label) in enumerate(test_dataset): prediction = model(input_data) confidence = prediction.max() predicted_class = prediction.argmax() eval_table.add_data( idx, wandb.Image(input_data), # 이미지 또는 기타 미디어 로깅 label, predicted_class, confidence, label == predicted_class ) # 테이블 로깅 run.log({"evaluation_results": eval_table}) ```

### 고급 테이블 워크플로

#### 여러 모델 비교

서로 다른 모델의 평가 테이블을 같은 키에 로깅해 직접 비교합니다: ```python theme={null} # 모델 A 평가 with wandb.init(project="model-comparison", name="model_a") as run: eval_table_a = create_eval_table(model_a, test_data) run.log({"test_predictions": eval_table_a}) # 모델 B 평가 with wandb.init(project="model-comparison", name="model_b") as run: eval_table_b = create_eval_table(model_b, test_data) run.log({"test_predictions": eval_table_b}) ``` 트레이닝 에포크별 모델 예측 나란히 비교

#### 시간 경과에 따른 예측 추적

각 트레이닝 에포크마다 테이블을 로깅해 성능 향상을 시각화합니다: ```python theme={null} for epoch in range(num_epochs): train_model(model, train_data) # 이 에포크의 예측을 평가하고 기록합니다 eval_table = wandb.Table(columns=["image", "truth", "prediction"]) for image, label in test_subset: pred = model(image) eval_table.add_data(wandb.Image(image), label, pred.argmax()) wandb.log({f"predictions_epoch_{epoch}": eval_table}) ```

### W\&B UI에서 대화형 분석

로깅이 완료되면 다음 작업을 수행할 수 있습니다. 1. **결과 필터링**: 열 헤더를 클릭해 예측 정확도, 신뢰도 임계값 또는 특정 클래스 기준으로 필터링합니다 2. **테이블 비교**: 여러 테이블 버전을 선택해 나란히 비교합니다 3. **데이터 쿼리**: 쿼리 바를 사용해 특정 패턴을 찾습니다(예: `"correct" = false AND "confidence" > 0.8`) 4. **그룹화 및 집계**: 예측 클래스별로 그룹화해 클래스별 정확도 메트릭을 확인합니다 W&B Tables에서 평가 결과를 대화형으로 필터링하고 쿼리하는 모습

### 예시: 정보가 보강된 테이블을 사용한 오류 분석

```python theme={null} # 분석 열을 추가하기 위한 가변 테이블 생성 eval_table = wandb.Table( columns=["id", "image", "label", "prediction"], log_mode="MUTABLE" # 나중에 열 추가 허용 ) # 초기 예측 for idx, (img, label) in enumerate(test_data): pred = model(img) eval_table.add_data(idx, wandb.Image(img), label, pred.argmax()) run.log({"eval_analysis": eval_table}) # 오류 분석을 위한 신뢰도 점수 추가 confidences = [model(img).max() for img, _ in test_data] eval_table.add_column("confidence", confidences) # 오류 유형 추가 error_types = classify_errors(eval_table.get_column("label"), eval_table.get_column("prediction")) eval_table.add_column("error_type", error_types) run.log({"eval_analysis": eval_table}) ```