老胡茶室
老胡茶室

排错:LangChain.js + gemini-embedding-001 向量全为空,导致 pgvector 报错 vector must have at least 1 dimension

冯宇

现象

我们在使用 Langchain JS 提供的 index 函数( @langchain/core/indexing )做 documents embedding + 写入 pgvector 时,报错:

error: vector must have at least 1 dimension
...
routine: 'vector_in'

具体出错部分代码的示例如下:

import { PostgresRecordManager } from "@langchain/community/indexes/postgres";
import { PGVectorStore } from "@langchain/community/vectorstores/pgvector";
import type { Document } from "@langchain/core/documents";
import type { Embeddings } from "@langchain/core/embeddings";
import { index as langchainIndex } from "@langchain/core/indexing";
import { GoogleGenerativeAIEmbeddings } from "@langchain/google-genai";

const embeddings = new GoogleGenerativeAIEmbeddings({
  model: "gemini-embedding-001",
});

async function contentIndex(
  documents: Document[],
  myMetadata?: Record<string, unknown>
) {
  const namespace = myMetadata?.knowledgeId
    ? `knowledgeId-${myMetadata.knowledgeId as string}`
    : documents[0].metadata.source;
  const recordManager = new PostgresRecordManager(namespace, {
    postgresConnectionOptions: {
      connectionString: process.env.DATABASE_URL,
    },
  });
  await recordManager.createSchema();
  const embeddings = await getEmbeddings();
  const vectorStore = await PGVectorStore.initialize(embeddings, {
    postgresConnectionOptions: {
      connectionString: env.DATABASE_URL,
    },
    tableName: "documents",
    columns: {
      contentColumnName: "content",
    },
    dimensions: env.EMBEDDING_DIMENSIONS,
  });
  const result = await langchainIndex({
    docsSource: documents,
    recordManager,
    vectorStore,
    options: {
      cleanup: "full",
      sourceIdKey: "source",
    },
  });
  await vectorStore.end();
  await recordManager.end();
  return result;
}

错误是从 langchain.js 提供的 index 函数抛出来的,除此之外没有更多详细信息。而且发现无论重试多少次,pgvector 都会报错。

表面上是数据库报错,但很难从栈里定位是哪条 document/chunk 出问题。

排查步骤

先绕开 index 函数,直接使用 embeddings.embedDocuments() 测试:

const vectors = await embeddings.embedDocuments(
  documents.map((doc) => doc.pageContent)
);
console.log(vectors);

结果发现了打印的内容如下:

[
  [], [], [], [], [], [], [], [],
  ... // 大量空向量
]

证明在 gemini embedding API 已经出现了不知名的错误,分析了 GoogleGenerativeAIEmbeddings 的源码,发现它处理批量 embedding 的代码如下:

const batchEmbedRequests = batchEmbedChunks.map((chunk) => ({
  requests: chunk.map((doc) => this._convertToContent(doc)),
}));

const responses = await Promise.allSettled(
  batchEmbedRequests.map((req) => this.client.batchEmbedContents(req))
);

const embeddings = responses.flatMap((res, idx) => {
  if (res.status === "fulfilled") {
    return res.value.embeddings.map((e) => e.values || []);
  } else {
    return Array(batchEmbedChunks[idx].length).fill([]);
  }
});

return embeddings;

这段代码使用了 Promise.allSettled 并行请求 API,但是底层却没有处理异常错误,导致异常被吃掉了,而且后面还使用了 fill 填充空向量,导致后续写入数据库时出错。

为了了解底层的错误是什么,我们添加了一个 “猴子补丁” 来打印异常信息:

function attachGenAIDebug(embeddings: any) {
  const client = embeddings?.client;
  if (!client) return;

  for (const method of ["batchEmbedContents", "embedContent"] as const) {
    if (typeof client[method] !== "function") continue;

    const orig = client[method].bind(client);
    client[method] = async (req: any) => {
      const meta =
        method === "batchEmbedContents"
          ? {
              requests: req?.requests?.length,
              totalChars: (req?.requests ?? []).reduce(
                (s: number, r: any) =>
                  s + (r?.content?.parts?.[0]?.text?.length ?? 0),
                0
              ),
            }
          : { chars: req?.content?.parts?.[0]?.text?.length ?? 0 };

      try {
        const res = await orig(req);
        const dims =
          res?.embedding?.values?.length ??
          res?.embeddings?.[0]?.values?.length ??
          0;
        console.error(`[genai] ${method} OK`, meta, { dims });
        return res;
      } catch (err) {
        console.error(`[genai] ${method} FAIL`, meta, err);
        throw err;
      }
    };
  }
}

然后使用这个补丁拦截错误:

const embeddings = new GoogleGenerativeAIEmbeddings();
attachGenAIDebug(embeddings);
const vectors = await embeddings.embedDocuments(
  documents.map((doc) => doc.pageContent)
);
// ...

这次终于看到了掩藏在现象背后的真实错误:

[genai] batchEmbedContents FAIL { requests: 100, totalChars: 49929 } GoogleGenerativeAIFetchError: [GoogleGenerativeAI Error]: Error fetching from https://generativelanguage.googleapis.com/v1beta/models/gemini-embedding-001:batchEmbedContents: [429 Too Many Requests] You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits. To monitor your current usage, head to: https://ai.dev/usage?tab=rate-limit.
* Quota exceeded for metric: generativelanguage.googleapis.com/embed_content_free_tier_requests, limit: 100, model: gemini-embedding-1.0
Please retry in 20.088593563s. [{"@type":"type.googleapis.com/google.rpc.Help","links":[{"description":"Learn more about Gemini API quotas","url":"https://ai.google.dev/gemini-api/docs/rate-limits"}]},{"@type":"type.googleapis.com/google.rpc.QuotaFailure","violations":[{"quotaMetric":"generativelanguage.googleapis.com/embed_content_free_tier_requests","quotaId":"EmbedContentRequestsPerMinutePerUserPerProjectPerModel-FreeTier","quotaDimensions":{"location":"global","model":"gemini-embedding-1.0"},"quotaValue":"100"}]},{"@type":"type.googleapis.com/google.rpc.RetryInfo","retryDelay":"20s"}]
    at handleResponseNotOk
{
  status: 429,
  statusText: 'Too Many Requests',
  errorDetails: [
    { '@type': 'type.googleapis.com/google.rpc.Help', links: [Array] },
    {
      '@type': 'type.googleapis.com/google.rpc.QuotaFailure',
      violations: [Array]
    },
    {
      '@type': 'type.googleapis.com/google.rpc.RetryInfo',
      retryDelay: '20s'
    }
  ]
}
[genai] batchEmbedContents OK { requests: 42, totalChars: 25856 } { dims: 3072 }
[
  [], [], [], [], [], [], [], [], [], [], [], [],
  [], [], [], [], [], [], [], [], [], [], [], [],
  [], [], [], [], [], [], [], [], [], [], [], [],
  [], [], [], [], [], [], [], [], [], [], [], [],
  [], [], [], [], [], [], [], [], [], [], [], [],
  [], [], [], [], [], [], [], [], [], [], [], [],
  [], [], [], [], [], [], [], [], [], [], [], [],
  [], [], [], [], [], [], [], [], [], [], [], [],
  [], [], [], [],
  ... 42 more items
]

这下真相大白,原来 gemini embedding API 的调用频率限制是 100/min,所以如果超过 100/min 的调用,就会触发 429 rate limit 错误。

解决方案

知道了问题就可以针对性的解决了。首先就是 maxRetries 这个参数完全不会起作用,因为我们已经发现了它底层已经把所有的异常吃掉了,根本不会上抛,也就更本不会触发 AsyncCaller 的重试机制。

因此唯一能做的就是想办法减少每次 embedding 的批次大小。

GoogleGenerativeAIEmbeddings 源码中已经定义了一个 maxBatchSize 参数,固定 100,这个参数是写死的,没有提供修改的渠道,那么唯一能控制的就是调用 embedDocuments 的批次大小了。

如果直接调用 embeddings.embedDocuments(documents),那么我们可以手动先处理 documents 的大小,然后分批次调用 embedDocuments

对于 index 函数,我发现它提供了一个 batchSize 参数,内部会帮我们控制 documents 的批次大小:

import { index as langchainIndex } from "@langchain/core/indexing";

const result = await langchainIndex({
  docsSource: documents,
  recordManager,
  vectorStore,
  options: {
    cleanup: "full",
    sourceIdKey: "source",
    batchSize: 10,
  },
});

虽然调用太快了依旧会导致 error: vector must have at least 1 dimension 错误,但是提供了自行重试修复的可能性,经过不断外部自行重试,所有的 documents 总会有完成 embedding 的时候,而不是永远卡住无法完成。

总结

LangChainJS 提供的 Gemini Embedding API 封装,实现比较简单,对于流控处理也不是特别完善。因此只能自行控制批次大小,设置合理的重试规则,才能让 Gemini Embedding 功能正常使用。

精品内容