============================= test session starts ==============================
platform darwin -- Python 3.11.9, pytest-7.3.1, pluggy-1.3.0
rootdir: /Users/krrishdholakia/Documents/litellm/litellm/tests
plugins: timeout-2.2.0, asyncio-0.23.2, anyio-3.7.1, xdist-3.3.1
asyncio: mode=Mode.STRICT
collected 2 items

test_streaming.py .Token Counter - using hugging face token counter, for model=llama-3-8b-instruct
Looking up model=llama-3-8b-instruct in model_cost_map
F                                                     [100%]

=================================== FAILURES ===================================
__________________ test_completion_predibase_streaming[True] ___________________

model = 'llama-3-8b-instruct'
messages = [{'content': 'What is the meaning of life?', 'role': 'user'}]
timeout = 600.0, temperature = None, top_p = None, n = None, stream = True
stream_options = None, stop = None, max_tokens = None, presence_penalty = None
frequency_penalty = None, logit_bias = None, user = None, response_format = None
seed = None, tools = None, tool_choice = None, logprobs = None
top_logprobs = None, deployment_id = None, extra_headers = None
functions = None, function_call = None, base_url = None, api_version = None
api_key = 'pb_Qg9YbQo7UqqHdu0ozxN_aw', model_list = None
kwargs = {'api_base': 'https://serving.app.predibase.com', 'litellm_call_id': 'cf0ea464-1b45-4473-8e55-6bf6809df7a7', 'litellm_logging_obj': <litellm.utils.Logging object at 0x10a5cbf50>, 'tenant_id': 'c4768f95'}
args = {'acompletion': False, 'api_base': None, 'api_key': 'pb_Qg9YbQo7UqqHdu0ozxN_aw', 'api_version': None, ...}
api_base = None, mock_response = None, force_timeout = 600, logger_fn = None
verbose = False, custom_llm_provider = 'predibase'

    @client
    def completion(
        model: str,
        # Optional OpenAI params: see https://platform.openai.com/docs/api-reference/chat/create
        messages: List = [],
        timeout: Optional[Union[float, str, httpx.Timeout]] = None,
        temperature: Optional[float] = None,
        top_p: Optional[float] = None,
        n: Optional[int] = None,
        stream: Optional[bool] = None,
        stream_options: Optional[dict] = None,
        stop=None,
        max_tokens: Optional[int] = None,
        presence_penalty: Optional[float] = None,
        frequency_penalty: Optional[float] = None,
        logit_bias: Optional[dict] = None,
        user: Optional[str] = None,
        # openai v1.0+ new params
        response_format: Optional[dict] = None,
        seed: Optional[int] = None,
        tools: Optional[List] = None,
        tool_choice: Optional[str] = None,
        logprobs: Optional[bool] = None,
        top_logprobs: Optional[int] = None,
        deployment_id=None,
        extra_headers: Optional[dict] = None,
        # soon to be deprecated params by OpenAI
        functions: Optional[List] = None,
        function_call: Optional[str] = None,
        # set api_base, api_version, api_key
        base_url: Optional[str] = None,
        api_version: Optional[str] = None,
        api_key: Optional[str] = None,
        model_list: Optional[list] = None,  # pass in a list of api_base,keys, etc.
        # Optional liteLLM function params
        **kwargs,
    ) -> Union[ModelResponse, CustomStreamWrapper]:
        """
        Perform a completion() using any of litellm supported llms (example gpt-4, gpt-3.5-turbo, claude-2, command-nightly)
        Parameters:
            model (str): The name of the language model to use for text completion. see all supported LLMs: https://docs.litellm.ai/docs/providers/
            messages (List): A list of message objects representing the conversation context (default is an empty list).
    
            OPTIONAL PARAMS
            functions (List, optional): A list of functions to apply to the conversation messages (default is an empty list).
            function_call (str, optional): The name of the function to call within the conversation (default is an empty string).
            temperature (float, optional): The temperature parameter for controlling the randomness of the output (default is 1.0).
            top_p (float, optional): The top-p parameter for nucleus sampling (default is 1.0).
            n (int, optional): The number of completions to generate (default is 1).
            stream (bool, optional): If True, return a streaming response (default is False).
            stream_options (dict, optional): A dictionary containing options for the streaming response. Only set this when you set stream: true.
            stop(string/list, optional): - Up to 4 sequences where the LLM API will stop generating further tokens.
            max_tokens (integer, optional): The maximum number of tokens in the generated completion (default is infinity).
            presence_penalty (float, optional): It is used to penalize new tokens based on their existence in the text so far.
            frequency_penalty: It is used to penalize new tokens based on their frequency in the text so far.
            logit_bias (dict, optional): Used to modify the probability of specific tokens appearing in the completion.
            user (str, optional):  A unique identifier representing your end-user. This can help the LLM provider to monitor and detect abuse.
            logprobs (bool, optional): Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message
            top_logprobs (int, optional): An integer between 0 and 5 specifying the number of most likely tokens to return at each token position, each with an associated log probability. logprobs must be set to true if this parameter is used.
            metadata (dict, optional): Pass in additional metadata to tag your completion calls - eg. prompt version, details, etc.
            api_base (str, optional): Base URL for the API (default is None).
            api_version (str, optional): API version (default is None).
            api_key (str, optional): API key (default is None).
            model_list (list, optional): List of api base, version, keys
            extra_headers (dict, optional): Additional headers to include in the request.
    
            LITELLM Specific Params
            mock_response (str, optional): If provided, return a mock completion response for testing or debugging purposes (default is None).
            custom_llm_provider (str, optional): Used for Non-OpenAI LLMs, Example usage for bedrock, set model="amazon.titan-tg1-large" and custom_llm_provider="bedrock"
            max_retries (int, optional): The number of retries to attempt (default is 0).
        Returns:
            ModelResponse: A response object containing the generated completion and associated metadata.
    
        Note:
            - This function is used to perform completions() using the specified language model.
            - It supports various optional parameters for customizing the completion behavior.
            - If 'mock_response' is provided, a mock completion response is returned for testing or debugging.
        """
        ######### unpacking kwargs #####################
        args = locals()
        api_base = kwargs.get("api_base", None)
        mock_response = kwargs.get("mock_response", None)
        force_timeout = kwargs.get("force_timeout", 600)  ## deprecated
        logger_fn = kwargs.get("logger_fn", None)
        verbose = kwargs.get("verbose", False)
        custom_llm_provider = kwargs.get("custom_llm_provider", None)
        litellm_logging_obj = kwargs.get("litellm_logging_obj", None)
        id = kwargs.get("id", None)
        metadata = kwargs.get("metadata", None)
        model_info = kwargs.get("model_info", None)
        proxy_server_request = kwargs.get("proxy_server_request", None)
        fallbacks = kwargs.get("fallbacks", None)
        headers = kwargs.get("headers", None)
        num_retries = kwargs.get("num_retries", None)  ## deprecated
        max_retries = kwargs.get("max_retries", None)
        context_window_fallback_dict = kwargs.get("context_window_fallback_dict", None)
        organization = kwargs.get("organization", None)
        ### CUSTOM MODEL COST ###
        input_cost_per_token = kwargs.get("input_cost_per_token", None)
        output_cost_per_token = kwargs.get("output_cost_per_token", None)
        input_cost_per_second = kwargs.get("input_cost_per_second", None)
        output_cost_per_second = kwargs.get("output_cost_per_second", None)
        ### CUSTOM PROMPT TEMPLATE ###
        initial_prompt_value = kwargs.get("initial_prompt_value", None)
        roles = kwargs.get("roles", None)
        final_prompt_value = kwargs.get("final_prompt_value", None)
        bos_token = kwargs.get("bos_token", None)
        eos_token = kwargs.get("eos_token", None)
        preset_cache_key = kwargs.get("preset_cache_key", None)
        hf_model_name = kwargs.get("hf_model_name", None)
        supports_system_message = kwargs.get("supports_system_message", None)
        ### TEXT COMPLETION CALLS ###
        text_completion = kwargs.get("text_completion", False)
        atext_completion = kwargs.get("atext_completion", False)
        ### ASYNC CALLS ###
        acompletion = kwargs.get("acompletion", False)
        client = kwargs.get("client", None)
        ### Admin Controls ###
        no_log = kwargs.get("no-log", False)
        ######## end of unpacking kwargs ###########
        openai_params = [
            "functions",
            "function_call",
            "temperature",
            "temperature",
            "top_p",
            "n",
            "stream",
            "stream_options",
            "stop",
            "max_tokens",
            "presence_penalty",
            "frequency_penalty",
            "logit_bias",
            "user",
            "request_timeout",
            "api_base",
            "api_version",
            "api_key",
            "deployment_id",
            "organization",
            "base_url",
            "default_headers",
            "timeout",
            "response_format",
            "seed",
            "tools",
            "tool_choice",
            "max_retries",
            "logprobs",
            "top_logprobs",
            "extra_headers",
        ]
        litellm_params = [
            "metadata",
            "acompletion",
            "atext_completion",
            "text_completion",
            "caching",
            "mock_response",
            "api_key",
            "api_version",
            "api_base",
            "force_timeout",
            "logger_fn",
            "verbose",
            "custom_llm_provider",
            "litellm_logging_obj",
            "litellm_call_id",
            "use_client",
            "id",
            "fallbacks",
            "azure",
            "headers",
            "model_list",
            "num_retries",
            "context_window_fallback_dict",
            "retry_policy",
            "roles",
            "final_prompt_value",
            "bos_token",
            "eos_token",
            "request_timeout",
            "complete_response",
            "self",
            "client",
            "rpm",
            "tpm",
            "max_parallel_requests",
            "input_cost_per_token",
            "output_cost_per_token",
            "input_cost_per_second",
            "output_cost_per_second",
            "hf_model_name",
            "model_info",
            "proxy_server_request",
            "preset_cache_key",
            "caching_groups",
            "ttl",
            "cache",
            "no-log",
            "base_model",
            "stream_timeout",
            "supports_system_message",
            "region_name",
            "allowed_model_region",
        ]
        default_params = openai_params + litellm_params
        non_default_params = {
            k: v for k, v in kwargs.items() if k not in default_params
        }  # model-specific params - pass them straight to the model/provider
    
        ### TIMEOUT LOGIC ###
        timeout = timeout or kwargs.get("request_timeout", 600) or 600
        # set timeout for 10 minutes by default
    
        if (
            timeout is not None
            and isinstance(timeout, httpx.Timeout)
            and supports_httpx_timeout(custom_llm_provider) == False
        ):
            read_timeout = timeout.read or 600
            timeout = read_timeout  # default 10 min timeout
        elif timeout is not None and not isinstance(timeout, httpx.Timeout):
            timeout = float(timeout)  # type: ignore
    
        try:
            if base_url is not None:
                api_base = base_url
            if max_retries is not None:  # openai allows openai.OpenAI(max_retries=3)
                num_retries = max_retries
            logging = litellm_logging_obj
            fallbacks = fallbacks or litellm.model_fallbacks
            if fallbacks is not None:
                return completion_with_fallbacks(**args)
            if model_list is not None:
                deployments = [
                    m["litellm_params"] for m in model_list if m["model_name"] == model
                ]
                return batch_completion_models(deployments=deployments, **args)
            if litellm.model_alias_map and model in litellm.model_alias_map:
                model = litellm.model_alias_map[
                    model
                ]  # update the model to the actual value if an alias has been passed in
            model_response = ModelResponse()
            setattr(model_response, "usage", litellm.Usage())
            if (
                kwargs.get("azure", False) == True
            ):  # don't remove flag check, to remain backwards compatible for repos like Codium
                custom_llm_provider = "azure"
            if deployment_id != None:  # azure llms
                model = deployment_id
                custom_llm_provider = "azure"
            model, custom_llm_provider, dynamic_api_key, api_base = get_llm_provider(
                model=model,
                custom_llm_provider=custom_llm_provider,
                api_base=api_base,
                api_key=api_key,
            )
            if model_response is not None and hasattr(model_response, "_hidden_params"):
                model_response._hidden_params["custom_llm_provider"] = custom_llm_provider
                model_response._hidden_params["region_name"] = kwargs.get(
                    "aws_region_name", None
                )  # support region-based pricing for bedrock
    
            ### REGISTER CUSTOM MODEL PRICING -- IF GIVEN ###
            if input_cost_per_token is not None and output_cost_per_token is not None:
                print_verbose(f"Registering model={model} in model cost map")
                litellm.register_model(
                    {
                        f"{custom_llm_provider}/{model}": {
                            "input_cost_per_token": input_cost_per_token,
                            "output_cost_per_token": output_cost_per_token,
                            "litellm_provider": custom_llm_provider,
                        },
                        model: {
                            "input_cost_per_token": input_cost_per_token,
                            "output_cost_per_token": output_cost_per_token,
                            "litellm_provider": custom_llm_provider,
                        },
                    }
                )
            elif (
                input_cost_per_second is not None
            ):  # time based pricing just needs cost in place
                output_cost_per_second = output_cost_per_second
                litellm.register_model(
                    {
                        f"{custom_llm_provider}/{model}": {
                            "input_cost_per_second": input_cost_per_second,
                            "output_cost_per_second": output_cost_per_second,
                            "litellm_provider": custom_llm_provider,
                        },
                        model: {
                            "input_cost_per_second": input_cost_per_second,
                            "output_cost_per_second": output_cost_per_second,
                            "litellm_provider": custom_llm_provider,
                        },
                    }
                )
            ### BUILD CUSTOM PROMPT TEMPLATE -- IF GIVEN ###
            custom_prompt_dict = {}  # type: ignore
            if (
                initial_prompt_value
                or roles
                or final_prompt_value
                or bos_token
                or eos_token
            ):
                custom_prompt_dict = {model: {}}
                if initial_prompt_value:
                    custom_prompt_dict[model]["initial_prompt_value"] = initial_prompt_value
                if roles:
                    custom_prompt_dict[model]["roles"] = roles
                if final_prompt_value:
                    custom_prompt_dict[model]["final_prompt_value"] = final_prompt_value
                if bos_token:
                    custom_prompt_dict[model]["bos_token"] = bos_token
                if eos_token:
                    custom_prompt_dict[model]["eos_token"] = eos_token
    
            if (
                supports_system_message is not None
                and isinstance(supports_system_message, bool)
                and supports_system_message == False
            ):
                messages = map_system_message_pt(messages=messages)
            model_api_key = get_api_key(
                llm_provider=custom_llm_provider, dynamic_api_key=api_key
            )  # get the api key from the environment if required for the model
    
            if dynamic_api_key is not None:
                api_key = dynamic_api_key
            # check if user passed in any of the OpenAI optional params
            optional_params = get_optional_params(
                functions=functions,
                function_call=function_call,
                temperature=temperature,
                top_p=top_p,
                n=n,
                stream=stream,
                stream_options=stream_options,
                stop=stop,
                max_tokens=max_tokens,
                presence_penalty=presence_penalty,
                frequency_penalty=frequency_penalty,
                logit_bias=logit_bias,
                user=user,
                # params to identify the model
                model=model,
                custom_llm_provider=custom_llm_provider,
                response_format=response_format,
                seed=seed,
                tools=tools,
                tool_choice=tool_choice,
                max_retries=max_retries,
                logprobs=logprobs,
                top_logprobs=top_logprobs,
                extra_headers=extra_headers,
                **non_default_params,
            )
    
            if litellm.add_function_to_prompt and optional_params.get(
                "functions_unsupported_model", None
            ):  # if user opts to add it to prompt, when API doesn't support function calling
                functions_unsupported_model = optional_params.pop(
                    "functions_unsupported_model"
                )
                messages = function_call_prompt(
                    messages=messages, functions=functions_unsupported_model
                )
    
            # For logging - save the values of the litellm-specific params passed in
            litellm_params = get_litellm_params(
                acompletion=acompletion,
                api_key=api_key,
                force_timeout=force_timeout,
                logger_fn=logger_fn,
                verbose=verbose,
                custom_llm_provider=custom_llm_provider,
                api_base=api_base,
                litellm_call_id=kwargs.get("litellm_call_id", None),
                model_alias_map=litellm.model_alias_map,
                completion_call_id=id,
                metadata=metadata,
                model_info=model_info,
                proxy_server_request=proxy_server_request,
                preset_cache_key=preset_cache_key,
                no_log=no_log,
            )
            logging.update_environment_variables(
                model=model,
                user=user,
                optional_params=optional_params,
                litellm_params=litellm_params,
            )
            if mock_response:
                return mock_completion(
                    model,
                    messages,
                    stream=stream,
                    mock_response=mock_response,
                    logging=logging,
                    acompletion=acompletion,
                )
            if custom_llm_provider == "azure":
                # azure configs
                api_type = get_secret("AZURE_API_TYPE") or "azure"
    
                api_base = api_base or litellm.api_base or get_secret("AZURE_API_BASE")
    
                api_version = (
                    api_version or litellm.api_version or get_secret("AZURE_API_VERSION")
                )
    
                api_key = (
                    api_key
                    or litellm.api_key
                    or litellm.azure_key
                    or get_secret("AZURE_OPENAI_API_KEY")
                    or get_secret("AZURE_API_KEY")
                )
    
                azure_ad_token = optional_params.get("extra_body", {}).pop(
                    "azure_ad_token", None
                ) or get_secret("AZURE_AD_TOKEN")
    
                headers = headers or litellm.headers
    
                ## LOAD CONFIG - if set
                config = litellm.AzureOpenAIConfig.get_config()
                for k, v in config.items():
                    if (
                        k not in optional_params
                    ):  # completion(top_k=3) > azure_config(top_k=3) <- allows for dynamic variables to be passed in
                        optional_params[k] = v
    
                ## COMPLETION CALL
                response = azure_chat_completions.completion(
                    model=model,
                    messages=messages,
                    headers=headers,
                    api_key=api_key,
                    api_base=api_base,
                    api_version=api_version,
                    api_type=api_type,
                    azure_ad_token=azure_ad_token,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    logging_obj=logging,
                    acompletion=acompletion,
                    timeout=timeout,  # type: ignore
                    client=client,  # pass AsyncAzureOpenAI, AzureOpenAI client
                )
    
                if optional_params.get("stream", False) or acompletion == True:
                    ## LOGGING
                    logging.post_call(
                        input=messages,
                        api_key=api_key,
                        original_response=response,
                        additional_args={
                            "headers": headers,
                            "api_version": api_version,
                            "api_base": api_base,
                        },
                    )
            elif custom_llm_provider == "azure_text":
                # azure configs
                api_type = get_secret("AZURE_API_TYPE") or "azure"
    
                api_base = api_base or litellm.api_base or get_secret("AZURE_API_BASE")
    
                api_version = (
                    api_version or litellm.api_version or get_secret("AZURE_API_VERSION")
                )
    
                api_key = (
                    api_key
                    or litellm.api_key
                    or litellm.azure_key
                    or get_secret("AZURE_OPENAI_API_KEY")
                    or get_secret("AZURE_API_KEY")
                )
    
                azure_ad_token = optional_params.get("extra_body", {}).pop(
                    "azure_ad_token", None
                ) or get_secret("AZURE_AD_TOKEN")
    
                headers = headers or litellm.headers
    
                ## LOAD CONFIG - if set
                config = litellm.AzureOpenAIConfig.get_config()
                for k, v in config.items():
                    if (
                        k not in optional_params
                    ):  # completion(top_k=3) > azure_config(top_k=3) <- allows for dynamic variables to be passed in
                        optional_params[k] = v
    
                ## COMPLETION CALL
                response = azure_text_completions.completion(
                    model=model,
                    messages=messages,
                    headers=headers,
                    api_key=api_key,
                    api_base=api_base,
                    api_version=api_version,
                    api_type=api_type,
                    azure_ad_token=azure_ad_token,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    logging_obj=logging,
                    acompletion=acompletion,
                    timeout=timeout,
                    client=client,  # pass AsyncAzureOpenAI, AzureOpenAI client
                )
    
                if optional_params.get("stream", False) or acompletion == True:
                    ## LOGGING
                    logging.post_call(
                        input=messages,
                        api_key=api_key,
                        original_response=response,
                        additional_args={
                            "headers": headers,
                            "api_version": api_version,
                            "api_base": api_base,
                        },
                    )
            elif (
                model in litellm.open_ai_chat_completion_models
                or custom_llm_provider == "custom_openai"
                or custom_llm_provider == "deepinfra"
                or custom_llm_provider == "perplexity"
                or custom_llm_provider == "groq"
                or custom_llm_provider == "deepseek"
                or custom_llm_provider == "anyscale"
                or custom_llm_provider == "mistral"
                or custom_llm_provider == "openai"
                or custom_llm_provider == "together_ai"
                or custom_llm_provider in litellm.openai_compatible_providers
                or "ft:gpt-3.5-turbo" in model  # finetune gpt-3.5-turbo
            ):  # allow user to make an openai call with a custom base
                # note: if a user sets a custom base - we should ensure this works
                # allow for the setting of dynamic and stateful api-bases
                api_base = (
                    api_base  # for deepinfra/perplexity/anyscale/groq we check in get_llm_provider and pass in the api base from there
                    or litellm.api_base
                    or get_secret("OPENAI_API_BASE")
                    or "https://api.openai.com/v1"
                )
                openai.organization = (
                    organization
                    or litellm.organization
                    or get_secret("OPENAI_ORGANIZATION")
                    or None  # default - https://github.com/openai/openai-python/blob/284c1799070c723c6a553337134148a7ab088dd8/openai/util.py#L105
                )
                # set API KEY
                api_key = (
                    api_key
                    or litellm.api_key  # for deepinfra/perplexity/anyscale we check in get_llm_provider and pass in the api key from there
                    or litellm.openai_key
                    or get_secret("OPENAI_API_KEY")
                )
    
                headers = headers or litellm.headers
    
                ## LOAD CONFIG - if set
                config = litellm.OpenAIConfig.get_config()
                for k, v in config.items():
                    if (
                        k not in optional_params
                    ):  # completion(top_k=3) > openai_config(top_k=3) <- allows for dynamic variables to be passed in
                        optional_params[k] = v
    
                ## COMPLETION CALL
                try:
                    response = openai_chat_completions.completion(
                        model=model,
                        messages=messages,
                        headers=headers,
                        model_response=model_response,
                        print_verbose=print_verbose,
                        api_key=api_key,
                        api_base=api_base,
                        acompletion=acompletion,
                        logging_obj=logging,
                        optional_params=optional_params,
                        litellm_params=litellm_params,
                        logger_fn=logger_fn,
                        timeout=timeout,  # type: ignore
                        custom_prompt_dict=custom_prompt_dict,
                        client=client,  # pass AsyncOpenAI, OpenAI client
                        organization=organization,
                        custom_llm_provider=custom_llm_provider,
                    )
                except Exception as e:
                    ## LOGGING - log the original exception returned
                    logging.post_call(
                        input=messages,
                        api_key=api_key,
                        original_response=str(e),
                        additional_args={"headers": headers},
                    )
                    raise e
    
                if optional_params.get("stream", False):
                    ## LOGGING
                    logging.post_call(
                        input=messages,
                        api_key=api_key,
                        original_response=response,
                        additional_args={"headers": headers},
                    )
            elif (
                custom_llm_provider == "text-completion-openai"
                or "ft:babbage-002" in model
                or "ft:davinci-002" in model  # support for finetuned completion models
            ):
                openai.api_type = "openai"
    
                api_base = (
                    api_base
                    or litellm.api_base
                    or get_secret("OPENAI_API_BASE")
                    or "https://api.openai.com/v1"
                )
    
                openai.api_version = None
                # set API KEY
    
                api_key = (
                    api_key
                    or litellm.api_key
                    or litellm.openai_key
                    or get_secret("OPENAI_API_KEY")
                )
    
                headers = headers or litellm.headers
    
                ## LOAD CONFIG - if set
                config = litellm.OpenAITextCompletionConfig.get_config()
                for k, v in config.items():
                    if (
                        k not in optional_params
                    ):  # completion(top_k=3) > openai_text_config(top_k=3) <- allows for dynamic variables to be passed in
                        optional_params[k] = v
                if litellm.organization:
                    openai.organization = litellm.organization
    
                if (
                    len(messages) > 0
                    and "content" in messages[0]
                    and type(messages[0]["content"]) == list
                ):
                    # text-davinci-003 can accept a string or array, if it's an array, assume the array is set in messages[0]['content']
                    # https://platform.openai.com/docs/api-reference/completions/create
                    prompt = messages[0]["content"]
                else:
                    prompt = " ".join([message["content"] for message in messages])  # type: ignore
    
                ## COMPLETION CALL
                _response = openai_text_completions.completion(
                    model=model,
                    messages=messages,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    api_key=api_key,
                    api_base=api_base,
                    acompletion=acompletion,
                    client=client,  # pass AsyncOpenAI, OpenAI client
                    logging_obj=logging,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    timeout=timeout,  # type: ignore
                )
    
                if (
                    optional_params.get("stream", False) == False
                    and acompletion == False
                    and text_completion == False
                ):
                    # convert to chat completion response
                    _response = litellm.OpenAITextCompletionConfig().convert_to_chat_model_response_object(
                        response_object=_response, model_response_object=model_response
                    )
    
                if optional_params.get("stream", False) or acompletion == True:
                    ## LOGGING
                    logging.post_call(
                        input=messages,
                        api_key=api_key,
                        original_response=_response,
                        additional_args={"headers": headers},
                    )
                response = _response
            elif (
                "replicate" in model
                or custom_llm_provider == "replicate"
                or model in litellm.replicate_models
            ):
                # Setting the relevant API KEY for replicate, replicate defaults to using os.environ.get("REPLICATE_API_TOKEN")
                replicate_key = None
                replicate_key = (
                    api_key
                    or litellm.replicate_key
                    or litellm.api_key
                    or get_secret("REPLICATE_API_KEY")
                    or get_secret("REPLICATE_API_TOKEN")
                )
    
                api_base = (
                    api_base
                    or litellm.api_base
                    or get_secret("REPLICATE_API_BASE")
                    or "https://api.replicate.com/v1"
                )
    
                custom_prompt_dict = custom_prompt_dict or litellm.custom_prompt_dict
    
                model_response = replicate.completion(
                    model=model,
                    messages=messages,
                    api_base=api_base,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,  # for calculating input/output tokens
                    api_key=replicate_key,
                    logging_obj=logging,
                    custom_prompt_dict=custom_prompt_dict,
                )
                if "stream" in optional_params and optional_params["stream"] == True:
                    # don't try to access stream object,
                    model_response = CustomStreamWrapper(model_response, model, logging_obj=logging, custom_llm_provider="replicate")  # type: ignore
    
                if optional_params.get("stream", False) or acompletion == True:
                    ## LOGGING
                    logging.post_call(
                        input=messages,
                        api_key=replicate_key,
                        original_response=model_response,
                    )
    
                response = model_response
    
            elif custom_llm_provider == "anthropic":
                api_key = (
                    api_key
                    or litellm.anthropic_key
                    or litellm.api_key
                    or os.environ.get("ANTHROPIC_API_KEY")
                )
                custom_prompt_dict = custom_prompt_dict or litellm.custom_prompt_dict
    
                if (model == "claude-2") or (model == "claude-instant-1"):
                    # call anthropic /completion, only use this route for claude-2, claude-instant-1
                    api_base = (
                        api_base
                        or litellm.api_base
                        or get_secret("ANTHROPIC_API_BASE")
                        or "https://api.anthropic.com/v1/complete"
                    )
                    response = anthropic_text_completions.completion(
                        model=model,
                        messages=messages,
                        api_base=api_base,
                        acompletion=acompletion,
                        custom_prompt_dict=litellm.custom_prompt_dict,
                        model_response=model_response,
                        print_verbose=print_verbose,
                        optional_params=optional_params,
                        litellm_params=litellm_params,
                        logger_fn=logger_fn,
                        encoding=encoding,  # for calculating input/output tokens
                        api_key=api_key,
                        logging_obj=logging,
                        headers=headers,
                    )
                else:
                    # call /messages
                    # default route for all anthropic models
                    api_base = (
                        api_base
                        or litellm.api_base
                        or get_secret("ANTHROPIC_API_BASE")
                        or "https://api.anthropic.com/v1/messages"
                    )
                    response = anthropic_chat_completions.completion(
                        model=model,
                        messages=messages,
                        api_base=api_base,
                        acompletion=acompletion,
                        custom_prompt_dict=litellm.custom_prompt_dict,
                        model_response=model_response,
                        print_verbose=print_verbose,
                        optional_params=optional_params,
                        litellm_params=litellm_params,
                        logger_fn=logger_fn,
                        encoding=encoding,  # for calculating input/output tokens
                        api_key=api_key,
                        logging_obj=logging,
                        headers=headers,
                    )
                if optional_params.get("stream", False) or acompletion == True:
                    ## LOGGING
                    logging.post_call(
                        input=messages,
                        api_key=api_key,
                        original_response=response,
                    )
                response = response
            elif custom_llm_provider == "nlp_cloud":
                nlp_cloud_key = (
                    api_key
                    or litellm.nlp_cloud_key
                    or get_secret("NLP_CLOUD_API_KEY")
                    or litellm.api_key
                )
    
                api_base = (
                    api_base
                    or litellm.api_base
                    or get_secret("NLP_CLOUD_API_BASE")
                    or "https://api.nlpcloud.io/v1/gpu/"
                )
    
                response = nlp_cloud.completion(
                    model=model,
                    messages=messages,
                    api_base=api_base,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,
                    api_key=nlp_cloud_key,
                    logging_obj=logging,
                )
    
                if "stream" in optional_params and optional_params["stream"] == True:
                    # don't try to access stream object,
                    response = CustomStreamWrapper(
                        response,
                        model,
                        custom_llm_provider="nlp_cloud",
                        logging_obj=logging,
                    )
    
                if optional_params.get("stream", False) or acompletion == True:
                    ## LOGGING
                    logging.post_call(
                        input=messages,
                        api_key=api_key,
                        original_response=response,
                    )
    
                response = response
            elif custom_llm_provider == "aleph_alpha":
                aleph_alpha_key = (
                    api_key
                    or litellm.aleph_alpha_key
                    or get_secret("ALEPH_ALPHA_API_KEY")
                    or get_secret("ALEPHALPHA_API_KEY")
                    or litellm.api_key
                )
    
                api_base = (
                    api_base
                    or litellm.api_base
                    or get_secret("ALEPH_ALPHA_API_BASE")
                    or "https://api.aleph-alpha.com/complete"
                )
    
                model_response = aleph_alpha.completion(
                    model=model,
                    messages=messages,
                    api_base=api_base,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,
                    default_max_tokens_to_sample=litellm.max_tokens,
                    api_key=aleph_alpha_key,
                    logging_obj=logging,  # model call logging done inside the class as we make need to modify I/O to fit aleph alpha's requirements
                )
    
                if "stream" in optional_params and optional_params["stream"] == True:
                    # don't try to access stream object,
                    response = CustomStreamWrapper(
                        model_response,
                        model,
                        custom_llm_provider="aleph_alpha",
                        logging_obj=logging,
                    )
                    return response
                response = model_response
            elif custom_llm_provider == "cohere":
                cohere_key = (
                    api_key
                    or litellm.cohere_key
                    or get_secret("COHERE_API_KEY")
                    or get_secret("CO_API_KEY")
                    or litellm.api_key
                )
    
                api_base = (
                    api_base
                    or litellm.api_base
                    or get_secret("COHERE_API_BASE")
                    or "https://api.cohere.ai/v1/generate"
                )
    
                model_response = cohere.completion(
                    model=model,
                    messages=messages,
                    api_base=api_base,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,
                    api_key=cohere_key,
                    logging_obj=logging,  # model call logging done inside the class as we make need to modify I/O to fit aleph alpha's requirements
                )
    
                if "stream" in optional_params and optional_params["stream"] == True:
                    # don't try to access stream object,
                    response = CustomStreamWrapper(
                        model_response,
                        model,
                        custom_llm_provider="cohere",
                        logging_obj=logging,
                    )
                    return response
                response = model_response
            elif custom_llm_provider == "cohere_chat":
                cohere_key = (
                    api_key
                    or litellm.cohere_key
                    or get_secret("COHERE_API_KEY")
                    or get_secret("CO_API_KEY")
                    or litellm.api_key
                )
    
                api_base = (
                    api_base
                    or litellm.api_base
                    or get_secret("COHERE_API_BASE")
                    or "https://api.cohere.ai/v1/chat"
                )
    
                model_response = cohere_chat.completion(
                    model=model,
                    messages=messages,
                    api_base=api_base,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,
                    api_key=cohere_key,
                    logging_obj=logging,  # model call logging done inside the class as we make need to modify I/O to fit aleph alpha's requirements
                )
    
                if "stream" in optional_params and optional_params["stream"] == True:
                    # don't try to access stream object,
                    response = CustomStreamWrapper(
                        model_response,
                        model,
                        custom_llm_provider="cohere_chat",
                        logging_obj=logging,
                    )
                    return response
                response = model_response
            elif custom_llm_provider == "maritalk":
                maritalk_key = (
                    api_key
                    or litellm.maritalk_key
                    or get_secret("MARITALK_API_KEY")
                    or litellm.api_key
                )
    
                api_base = (
                    api_base
                    or litellm.api_base
                    or get_secret("MARITALK_API_BASE")
                    or "https://chat.maritaca.ai/api/chat/inference"
                )
    
                model_response = maritalk.completion(
                    model=model,
                    messages=messages,
                    api_base=api_base,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,
                    api_key=maritalk_key,
                    logging_obj=logging,
                )
    
                if "stream" in optional_params and optional_params["stream"] == True:
                    # don't try to access stream object,
                    response = CustomStreamWrapper(
                        model_response,
                        model,
                        custom_llm_provider="maritalk",
                        logging_obj=logging,
                    )
                    return response
                response = model_response
            elif custom_llm_provider == "huggingface":
                custom_llm_provider = "huggingface"
                huggingface_key = (
                    api_key
                    or litellm.huggingface_key
                    or os.environ.get("HF_TOKEN")
                    or os.environ.get("HUGGINGFACE_API_KEY")
                    or litellm.api_key
                )
                hf_headers = headers or litellm.headers
    
                custom_prompt_dict = custom_prompt_dict or litellm.custom_prompt_dict
                model_response = huggingface.completion(
                    model=model,
                    messages=messages,
                    api_base=api_base,  # type: ignore
                    headers=hf_headers,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,
                    api_key=huggingface_key,
                    acompletion=acompletion,
                    logging_obj=logging,
                    custom_prompt_dict=custom_prompt_dict,
                    timeout=timeout,  # type: ignore
                )
                if (
                    "stream" in optional_params
                    and optional_params["stream"] == True
                    and acompletion is False
                ):
                    # don't try to access stream object,
                    response = CustomStreamWrapper(
                        model_response,
                        model,
                        custom_llm_provider="huggingface",
                        logging_obj=logging,
                    )
                    return response
                response = model_response
            elif custom_llm_provider == "oobabooga":
                custom_llm_provider = "oobabooga"
                model_response = oobabooga.completion(
                    model=model,
                    messages=messages,
                    model_response=model_response,
                    api_base=api_base,  # type: ignore
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    api_key=None,
                    logger_fn=logger_fn,
                    encoding=encoding,
                    logging_obj=logging,
                )
                if "stream" in optional_params and optional_params["stream"] == True:
                    # don't try to access stream object,
                    response = CustomStreamWrapper(
                        model_response,
                        model,
                        custom_llm_provider="oobabooga",
                        logging_obj=logging,
                    )
                    return response
                response = model_response
            elif custom_llm_provider == "openrouter":
                api_base = api_base or litellm.api_base or "https://openrouter.ai/api/v1"
    
                api_key = (
                    api_key
                    or litellm.api_key
                    or litellm.openrouter_key
                    or get_secret("OPENROUTER_API_KEY")
                    or get_secret("OR_API_KEY")
                )
    
                openrouter_site_url = get_secret("OR_SITE_URL") or "https://litellm.ai"
    
                openrouter_app_name = get_secret("OR_APP_NAME") or "liteLLM"
    
                headers = (
                    headers
                    or litellm.headers
                    or {
                        "HTTP-Referer": openrouter_site_url,
                        "X-Title": openrouter_app_name,
                    }
                )
    
                ## Load Config
                config = openrouter.OpenrouterConfig.get_config()
                for k, v in config.items():
                    if k == "extra_body":
                        # we use openai 'extra_body' to pass openrouter specific params - transforms, route, models
                        if "extra_body" in optional_params:
                            optional_params[k].update(v)
                        else:
                            optional_params[k] = v
                    elif k not in optional_params:
                        optional_params[k] = v
    
                data = {"model": model, "messages": messages, **optional_params}
    
                ## COMPLETION CALL
                response = openai_chat_completions.completion(
                    model=model,
                    messages=messages,
                    headers=headers,
                    api_key=api_key,
                    api_base=api_base,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    logging_obj=logging,
                    acompletion=acompletion,
                    timeout=timeout,  # type: ignore
                )
                ## LOGGING
                logging.post_call(
                    input=messages, api_key=openai.api_key, original_response=response
                )
            elif (
                custom_llm_provider == "together_ai"
                or ("togethercomputer" in model)
                or (model in litellm.together_ai_models)
            ):
                """
                Deprecated. We now do together ai calls via the openai client - https://docs.together.ai/docs/openai-api-compatibility
                """
                custom_llm_provider = "together_ai"
                together_ai_key = (
                    api_key
                    or litellm.togetherai_api_key
                    or get_secret("TOGETHER_AI_TOKEN")
                    or get_secret("TOGETHERAI_API_KEY")
                    or litellm.api_key
                )
    
                api_base = (
                    api_base
                    or litellm.api_base
                    or get_secret("TOGETHERAI_API_BASE")
                    or "https://api.together.xyz/inference"
                )
    
                custom_prompt_dict = custom_prompt_dict or litellm.custom_prompt_dict
    
                model_response = together_ai.completion(
                    model=model,
                    messages=messages,
                    api_base=api_base,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,
                    api_key=together_ai_key,
                    logging_obj=logging,
                    custom_prompt_dict=custom_prompt_dict,
                )
                if (
                    "stream_tokens" in optional_params
                    and optional_params["stream_tokens"] == True
                ):
                    # don't try to access stream object,
                    response = CustomStreamWrapper(
                        model_response,
                        model,
                        custom_llm_provider="together_ai",
                        logging_obj=logging,
                    )
                    return response
                response = model_response
            elif custom_llm_provider == "palm":
                palm_api_key = api_key or get_secret("PALM_API_KEY") or litellm.api_key
    
                # palm does not support streaming as yet :(
                model_response = palm.completion(
                    model=model,
                    messages=messages,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,
                    api_key=palm_api_key,
                    logging_obj=logging,
                )
                # fake palm streaming
                if "stream" in optional_params and optional_params["stream"] == True:
                    # fake streaming for palm
                    resp_string = model_response["choices"][0]["message"]["content"]
                    response = CustomStreamWrapper(
                        resp_string, model, custom_llm_provider="palm", logging_obj=logging
                    )
                    return response
                response = model_response
            elif custom_llm_provider == "gemini":
                gemini_api_key = (
                    api_key
                    or get_secret("GEMINI_API_KEY")
                    or get_secret("PALM_API_KEY")  # older palm api key should also work
                    or litellm.api_key
                )
    
                # palm does not support streaming as yet :(
                model_response = gemini.completion(
                    model=model,
                    messages=messages,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,
                    api_key=gemini_api_key,
                    logging_obj=logging,
                    acompletion=acompletion,
                    custom_prompt_dict=custom_prompt_dict,
                )
                if (
                    "stream" in optional_params
                    and optional_params["stream"] == True
                    and acompletion == False
                ):
                    response = CustomStreamWrapper(
                        iter(model_response),
                        model,
                        custom_llm_provider="gemini",
                        logging_obj=logging,
                    )
                    return response
                response = model_response
            elif custom_llm_provider == "vertex_ai":
                vertex_ai_project = (
                    optional_params.pop("vertex_project", None)
                    or optional_params.pop("vertex_ai_project", None)
                    or litellm.vertex_project
                    or get_secret("VERTEXAI_PROJECT")
                )
                vertex_ai_location = (
                    optional_params.pop("vertex_location", None)
                    or optional_params.pop("vertex_ai_location", None)
                    or litellm.vertex_location
                    or get_secret("VERTEXAI_LOCATION")
                )
                vertex_credentials = (
                    optional_params.pop("vertex_credentials", None)
                    or optional_params.pop("vertex_ai_credentials", None)
                    or get_secret("VERTEXAI_CREDENTIALS")
                )
                new_params = deepcopy(optional_params)
                if "claude-3" in model:
                    model_response = vertex_ai_anthropic.completion(
                        model=model,
                        messages=messages,
                        model_response=model_response,
                        print_verbose=print_verbose,
                        optional_params=new_params,
                        litellm_params=litellm_params,
                        logger_fn=logger_fn,
                        encoding=encoding,
                        vertex_location=vertex_ai_location,
                        vertex_project=vertex_ai_project,
                        vertex_credentials=vertex_credentials,
                        logging_obj=logging,
                        acompletion=acompletion,
                    )
                else:
                    model_response = vertex_ai.completion(
                        model=model,
                        messages=messages,
                        model_response=model_response,
                        print_verbose=print_verbose,
                        optional_params=new_params,
                        litellm_params=litellm_params,
                        logger_fn=logger_fn,
                        encoding=encoding,
                        vertex_location=vertex_ai_location,
                        vertex_project=vertex_ai_project,
                        vertex_credentials=vertex_credentials,
                        logging_obj=logging,
                        acompletion=acompletion,
                    )
    
                if (
                    "stream" in optional_params
                    and optional_params["stream"] == True
                    and acompletion == False
                ):
                    response = CustomStreamWrapper(
                        model_response,
                        model,
                        custom_llm_provider="vertex_ai",
                        logging_obj=logging,
                    )
                    return response
                response = model_response
            elif custom_llm_provider == "predibase":
                tenant_id = (
                    optional_params.pop("tenant_id", None)
                    or optional_params.pop("predibase_tenant_id", None)
                    or litellm.predibase_tenant_id
                    or get_secret("PREDIBASE_TENANT_ID")
                )
    
                api_base = (
                    optional_params.pop("api_base", None)
                    or optional_params.pop("base_url", None)
                    or litellm.api_base
                    or get_secret("PREDIBASE_API_BASE")
                )
    
                api_key = (
                    api_key
                    or litellm.api_key
                    or litellm.predibase_key
                    or get_secret("PREDIBASE_API_KEY")
                )
    
>               model_response = predibase_chat_completions.completion(
                    model=model,
                    messages=messages,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,
                    logging_obj=logging,
                    acompletion=acompletion,
                    api_base=api_base,
                    custom_prompt_dict=custom_prompt_dict,
                    api_key=api_key,
                    tenant_id=tenant_id,
                )

../main.py:1813: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <litellm.llms.predibase.PredibaseChatCompletion object at 0x108b4e990>
model = 'llama-3-8b-instruct'
messages = [{'content': 'What is the meaning of life?', 'role': 'user'}]
api_base = None, custom_prompt_dict = {}
model_response = ModelResponse(id='chatcmpl-755fcb98-22ba-46a2-9d6d-1a85b4363e98', choices=[Choices(finish_reason='stop', index=0, mess... role='assistant'))], created=1715301477, model=None, object='chat.completion', system_fingerprint=None, usage=Usage())
print_verbose = <function print_verbose at 0x108a5bf60>
encoding = <Encoding 'cl100k_base'>, api_key = 'pb_Qg9YbQo7UqqHdu0ozxN_aw'
logging_obj = <litellm.utils.Logging object at 0x10a5cbf50>
optional_params = {'details': True, 'max_new_tokens': 256, 'return_full_text': False}
tenant_id = 'c4768f95', acompletion = False
litellm_params = {'acompletion': False, 'api_base': 'https://serving.app.predibase.com/c4768f95/deployments/v2/llms/llama-3-8b-instruct/generate_stream', 'api_key': 'pb_Qg9YbQo7UqqHdu0ozxN_aw', 'completion_call_id': None, ...}
logger_fn = None
headers = {'Authorization': 'Bearer pb_Qg9YbQo7UqqHdu0ozxN_aw', 'content-type': 'application/json'}

    def completion(
        self,
        model: str,
        messages: list,
        api_base: str,
        custom_prompt_dict: dict,
        model_response: ModelResponse,
        print_verbose: Callable,
        encoding,
        api_key: str,
        logging_obj,
        optional_params: dict,
        tenant_id: str,
        acompletion=None,
        litellm_params=None,
        logger_fn=None,
        headers: dict = {},
    ) -> Union[ModelResponse, CustomStreamWrapper]:
        headers = self.validate_environment(api_key, headers)
        completion_url = ""
        input_text = ""
        base_url = "https://serving.app.predibase.com"
        if "https" in model:
            completion_url = model
        elif api_base:
            base_url = api_base
        elif "PREDIBASE_API_BASE" in os.environ:
            base_url = os.getenv("PREDIBASE_API_BASE", "")
    
        completion_url = f"{base_url}/{tenant_id}/deployments/v2/llms/{model}"
    
        if optional_params.get("stream", False) == True:
            completion_url += "/generate_stream"
        else:
            completion_url += "/generate"
    
        if model in custom_prompt_dict:
            # check if the model has a registered custom prompt
            model_prompt_details = custom_prompt_dict[model]
            prompt = custom_prompt(
                role_dict=model_prompt_details["roles"],
                initial_prompt_value=model_prompt_details["initial_prompt_value"],
                final_prompt_value=model_prompt_details["final_prompt_value"],
                messages=messages,
            )
        else:
            prompt = prompt_factory(model=model, messages=messages)
    
        ## Load Config
        config = litellm.PredibaseConfig.get_config()
        for k, v in config.items():
            if (
                k not in optional_params
            ):  # completion(top_k=3) > anthropic_config(top_k=3) <- allows for dynamic variables to be passed in
                optional_params[k] = v
    
        stream = optional_params.pop("stream", False)
    
        data = {
            "inputs": prompt,
            "parameters": optional_params,
        }
        input_text = prompt
        ## LOGGING
        logging_obj.pre_call(
            input=input_text,
            api_key=api_key,
            additional_args={
                "complete_input_dict": data,
                "headers": headers,
                "api_base": completion_url,
                "acompletion": acompletion,
            },
        )
        ## COMPLETION CALL
        if acompletion is True:
            ### ASYNC STREAMING
            if stream == True:
                return self.async_streaming(
                    model=model,
                    messages=messages,
                    data=data,
                    api_base=completion_url,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    encoding=encoding,
                    api_key=api_key,
                    logging_obj=logging_obj,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    headers=headers,
                )  # type: ignore
            else:
                ### ASYNC COMPLETION
                return self.async_completion(
                    model=model,
                    messages=messages,
                    data=data,
                    api_base=api_base,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    encoding=encoding,
                    api_key=api_key,
                    logging_obj=logging_obj,
                    optional_params=optional_params,
                    stream=False,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    headers=headers,
                )  # type: ignore
    
        ### SYNC STREAMING
        if stream == True:
            response = requests.post(
                completion_url,
                headers=headers,
                data=json.dumps(data),
>               stream=optional_params["stream"],
            )
E           KeyError: 'stream'

../llms/predibase.py:412: KeyError

During handling of the above exception, another exception occurred:

sync_mode = True

    @pytest.mark.parametrize("sync_mode", [True, False])
    @pytest.mark.asyncio
    async def test_completion_predibase_streaming(sync_mode):
        try:
            litellm.set_verbose = True
    
            if sync_mode:
>               response = completion(
                    model="predibase/llama-3-8b-instruct",
                    tenant_id="c4768f95",
                    api_base="https://serving.app.predibase.com",
                    api_key=os.getenv("PREDIBASE_API_KEY"),
                    messages=[{"role": "user", "content": "What is the meaning of life?"}],
                    stream=True,
                )

test_streaming.py:317: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

args = ()
kwargs = {'api_base': 'https://serving.app.predibase.com', 'api_key': 'pb_Qg9YbQo7UqqHdu0ozxN_aw', 'litellm_call_id': 'cf0ea464-1b45-4473-8e55-6bf6809df7a7', 'litellm_logging_obj': <litellm.utils.Logging object at 0x10a5cbf50>, ...}
result = None, start_time = datetime.datetime(2024, 5, 9, 17, 37, 57, 884661)
logging_obj = <litellm.utils.Logging object at 0x10a5cbf50>
call_type = 'completion', model = 'predibase/llama-3-8b-instruct'
k = 'litellm_logging_obj'

    @wraps(original_function)
    def wrapper(*args, **kwargs):
        # DO NOT MOVE THIS. It always needs to run first
        # Check if this is an async function. If so only execute the async function
        if (
            kwargs.get("acompletion", False) == True
            or kwargs.get("aembedding", False) == True
            or kwargs.get("aimg_generation", False) == True
            or kwargs.get("amoderation", False) == True
            or kwargs.get("atext_completion", False) == True
            or kwargs.get("atranscription", False) == True
        ):
            # [OPTIONAL] CHECK MAX RETRIES / REQUEST
            if litellm.num_retries_per_request is not None:
                # check if previous_models passed in as ['litellm_params']['metadata]['previous_models']
                previous_models = kwargs.get("metadata", {}).get(
                    "previous_models", None
                )
                if previous_models is not None:
                    if litellm.num_retries_per_request <= len(previous_models):
                        raise Exception(f"Max retries per request hit!")
    
            # MODEL CALL
            result = original_function(*args, **kwargs)
            if "stream" in kwargs and kwargs["stream"] == True:
                if (
                    "complete_response" in kwargs
                    and kwargs["complete_response"] == True
                ):
                    chunks = []
                    for idx, chunk in enumerate(result):
                        chunks.append(chunk)
                    return litellm.stream_chunk_builder(
                        chunks, messages=kwargs.get("messages", None)
                    )
                else:
                    return result
            return result
    
        # Prints Exactly what was passed to litellm function - don't execute any logic here - it should just print
        print_args_passed_to_litellm(original_function, args, kwargs)
        start_time = datetime.datetime.now()
        result = None
        logging_obj = kwargs.get("litellm_logging_obj", None)
    
        # only set litellm_call_id if its not in kwargs
        call_type = original_function.__name__
        if "litellm_call_id" not in kwargs:
            kwargs["litellm_call_id"] = str(uuid.uuid4())
        try:
            model = args[0] if len(args) > 0 else kwargs["model"]
        except:
            model = None
            if (
                call_type != CallTypes.image_generation.value
                and call_type != CallTypes.text_completion.value
            ):
                raise ValueError("model param not passed in.")
    
        try:
            if logging_obj is None:
                logging_obj, kwargs = function_setup(
                    original_function.__name__, rules_obj, start_time, *args, **kwargs
                )
            kwargs["litellm_logging_obj"] = logging_obj
    
            # CHECK FOR 'os.environ/' in kwargs
            for k, v in kwargs.items():
                if v is not None and isinstance(v, str) and v.startswith("os.environ/"):
                    kwargs[k] = litellm.get_secret(v)
            # [OPTIONAL] CHECK BUDGET
            if litellm.max_budget:
                if litellm._current_cost > litellm.max_budget:
                    raise BudgetExceededError(
                        current_cost=litellm._current_cost,
                        max_budget=litellm.max_budget,
                    )
    
            # [OPTIONAL] CHECK MAX RETRIES / REQUEST
            if litellm.num_retries_per_request is not None:
                # check if previous_models passed in as ['litellm_params']['metadata]['previous_models']
                previous_models = kwargs.get("metadata", {}).get(
                    "previous_models", None
                )
                if previous_models is not None:
                    if litellm.num_retries_per_request <= len(previous_models):
                        raise Exception(f"Max retries per request hit!")
    
            # [OPTIONAL] CHECK CACHE
            print_verbose(
                f"SYNC kwargs[caching]: {kwargs.get('caching', False)}; litellm.cache: {litellm.cache}; kwargs.get('cache')['no-cache']: {kwargs.get('cache', {}).get('no-cache', False)}"
            )
            # if caching is false or cache["no-cache"]==True, don't run this
            if (
                (
                    (
                        (
                            kwargs.get("caching", None) is None
                            and litellm.cache is not None
                        )
                        or kwargs.get("caching", False) == True
                    )
                    and kwargs.get("cache", {}).get("no-cache", False) != True
                )
                and kwargs.get("aembedding", False) != True
                and kwargs.get("atext_completion", False) != True
                and kwargs.get("acompletion", False) != True
                and kwargs.get("aimg_generation", False) != True
                and kwargs.get("atranscription", False) != True
            ):  # allow users to control returning cached responses from the completion function
                # checking cache
                print_verbose(f"INSIDE CHECKING CACHE")
                if (
                    litellm.cache is not None
                    and str(original_function.__name__)
                    in litellm.cache.supported_call_types
                ):
                    print_verbose(f"Checking Cache")
                    preset_cache_key = litellm.cache.get_cache_key(*args, **kwargs)
                    kwargs["preset_cache_key"] = (
                        preset_cache_key  # for streaming calls, we need to pass the preset_cache_key
                    )
                    cached_result = litellm.cache.get_cache(*args, **kwargs)
                    if cached_result != None:
                        if "detail" in cached_result:
                            # implies an error occurred
                            pass
                        else:
                            call_type = original_function.__name__
                            print_verbose(
                                f"Cache Response Object routing: call_type - {call_type}; cached_result instace: {type(cached_result)}"
                            )
                            if call_type == CallTypes.completion.value and isinstance(
                                cached_result, dict
                            ):
                                cached_result = convert_to_model_response_object(
                                    response_object=cached_result,
                                    model_response_object=ModelResponse(),
                                    stream=kwargs.get("stream", False),
                                )
                                if kwargs.get("stream", False) == True:
                                    cached_result = CustomStreamWrapper(
                                        completion_stream=cached_result,
                                        model=model,
                                        custom_llm_provider="cached_response",
                                        logging_obj=logging_obj,
                                    )
                            elif call_type == CallTypes.embedding.value and isinstance(
                                cached_result, dict
                            ):
                                cached_result = convert_to_model_response_object(
                                    response_object=cached_result,
                                    response_type="embedding",
                                )
    
                            # LOG SUCCESS
                            cache_hit = True
                            end_time = datetime.datetime.now()
                            (
                                model,
                                custom_llm_provider,
                                dynamic_api_key,
                                api_base,
                            ) = litellm.get_llm_provider(
                                model=model,
                                custom_llm_provider=kwargs.get(
                                    "custom_llm_provider", None
                                ),
                                api_base=kwargs.get("api_base", None),
                                api_key=kwargs.get("api_key", None),
                            )
                            print_verbose(
                                f"Async Wrapper: Completed Call, calling async_success_handler: {logging_obj.async_success_handler}"
                            )
                            logging_obj.update_environment_variables(
                                model=model,
                                user=kwargs.get("user", None),
                                optional_params={},
                                litellm_params={
                                    "logger_fn": kwargs.get("logger_fn", None),
                                    "acompletion": False,
                                    "metadata": kwargs.get("metadata", {}),
                                    "model_info": kwargs.get("model_info", {}),
                                    "proxy_server_request": kwargs.get(
                                        "proxy_server_request", None
                                    ),
                                    "preset_cache_key": kwargs.get(
                                        "preset_cache_key", None
                                    ),
                                    "stream_response": kwargs.get(
                                        "stream_response", {}
                                    ),
                                },
                                input=kwargs.get("messages", ""),
                                api_key=kwargs.get("api_key", None),
                                original_response=str(cached_result),
                                additional_args=None,
                                stream=kwargs.get("stream", False),
                            )
                            threading.Thread(
                                target=logging_obj.success_handler,
                                args=(cached_result, start_time, end_time, cache_hit),
                            ).start()
                            return cached_result
    
            # CHECK MAX TOKENS
            if (
                kwargs.get("max_tokens", None) is not None
                and model is not None
                and litellm.modify_params
                == True  # user is okay with params being modified
                and (
                    call_type == CallTypes.acompletion.value
                    or call_type == CallTypes.completion.value
                )
            ):
                try:
                    base_model = model
                    if kwargs.get("hf_model_name", None) is not None:
                        base_model = f"huggingface/{kwargs.get('hf_model_name')}"
                    max_output_tokens = (
                        get_max_tokens(model=base_model) or 4096
                    )  # assume min context window is 4k tokens
                    user_max_tokens = kwargs.get("max_tokens")
                    ## Scenario 1: User limit + prompt > model limit
                    messages = None
                    if len(args) > 1:
                        messages = args[1]
                    elif kwargs.get("messages", None):
                        messages = kwargs["messages"]
                    input_tokens = token_counter(model=base_model, messages=messages)
                    input_tokens += max(
                        0.1 * input_tokens, 10
                    )  # give at least a 10 token buffer. token counting can be imprecise.
                    if input_tokens > max_output_tokens:
                        pass  # allow call to fail normally
                    elif user_max_tokens + input_tokens > max_output_tokens:
                        user_max_tokens = max_output_tokens - input_tokens
                    print_verbose(f"user_max_tokens: {user_max_tokens}")
                    kwargs["max_tokens"] = int(
                        round(user_max_tokens)
                    )  # make sure max tokens is always an int
                except Exception as e:
                    print_verbose(f"Error while checking max token limit: {str(e)}")
            # MODEL CALL
            result = original_function(*args, **kwargs)
            end_time = datetime.datetime.now()
            if "stream" in kwargs and kwargs["stream"] == True:
                if (
                    "complete_response" in kwargs
                    and kwargs["complete_response"] == True
                ):
                    chunks = []
                    for idx, chunk in enumerate(result):
                        chunks.append(chunk)
                    return litellm.stream_chunk_builder(
                        chunks, messages=kwargs.get("messages", None)
                    )
                else:
                    return result
            elif "acompletion" in kwargs and kwargs["acompletion"] == True:
                return result
            elif "aembedding" in kwargs and kwargs["aembedding"] == True:
                return result
            elif "aimg_generation" in kwargs and kwargs["aimg_generation"] == True:
                return result
            elif "atranscription" in kwargs and kwargs["atranscription"] == True:
                return result
    
            ### POST-CALL RULES ###
            post_call_processing(original_response=result, model=model or None)
    
            # [OPTIONAL] ADD TO CACHE
            if (
                litellm.cache is not None
                and str(original_function.__name__)
                in litellm.cache.supported_call_types
            ) and (kwargs.get("cache", {}).get("no-store", False) != True):
                litellm.cache.add_cache(result, *args, **kwargs)
    
            # LOG SUCCESS - handle streaming success logging in the _next_ object, remove `handle_success` once it's deprecated
            verbose_logger.info(f"Wrapper: Completed Call, calling success_handler")
            threading.Thread(
                target=logging_obj.success_handler, args=(result, start_time, end_time)
            ).start()
            # RETURN RESULT
            if hasattr(result, "_hidden_params"):
                result._hidden_params["model_id"] = kwargs.get("model_info", {}).get(
                    "id", None
                )
                result._hidden_params["api_base"] = get_api_base(
                    model=model,
                    optional_params=getattr(logging_obj, "optional_params", {}),
                )
            result._response_ms = (
                end_time - start_time
            ).total_seconds() * 1000  # return response latency in ms like openai
            return result
        except Exception as e:
            call_type = original_function.__name__
            if call_type == CallTypes.completion.value:
                num_retries = (
                    kwargs.get("num_retries", None) or litellm.num_retries or None
                )
                litellm.num_retries = (
                    None  # set retries to None to prevent infinite loops
                )
                context_window_fallback_dict = kwargs.get(
                    "context_window_fallback_dict", {}
                )
    
                _is_litellm_router_call = "model_group" in kwargs.get(
                    "metadata", {}
                )  # check if call from litellm.router/proxy
                if (
                    num_retries and not _is_litellm_router_call
                ):  # only enter this if call is not from litellm router/proxy. router has it's own logic for retrying
                    if (
                        isinstance(e, openai.APIError)
                        or isinstance(e, openai.Timeout)
                        or isinstance(e, openai.APIConnectionError)
                    ):
                        kwargs["num_retries"] = num_retries
                        return litellm.completion_with_retries(*args, **kwargs)
                elif (
                    isinstance(e, litellm.exceptions.ContextWindowExceededError)
                    and context_window_fallback_dict
                    and model in context_window_fallback_dict
                ):
                    if len(args) > 0:
                        args[0] = context_window_fallback_dict[model]
                    else:
                        kwargs["model"] = context_window_fallback_dict[model]
                    return original_function(*args, **kwargs)
            traceback_exception = traceback.format_exc()
            end_time = datetime.datetime.now()
            # LOG FAILURE - handle streaming failure logging in the _next_ object, remove `handle_failure` once it's deprecated
            if logging_obj:
                logging_obj.failure_handler(
                    e, traceback_exception, start_time, end_time
                )  # DO NOT MAKE THREADED - router retry fallback relies on this!
                my_thread = threading.Thread(
                    target=handle_failure,
                    args=(e, traceback_exception, start_time, end_time, args, kwargs),
                )  # don't interrupt execution of main thread
                my_thread.start()
                if hasattr(e, "message"):
                    if (
                        liteDebuggerClient and liteDebuggerClient.dashboard_url != None
                    ):  # make it easy to get to the debugger logs if you've initialized it
                        e.message += f"\n Check the log in your dashboard - {liteDebuggerClient.dashboard_url}"
>           raise e

../utils.py:3229: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

args = ()
kwargs = {'api_base': 'https://serving.app.predibase.com', 'api_key': 'pb_Qg9YbQo7UqqHdu0ozxN_aw', 'litellm_call_id': 'cf0ea464-1b45-4473-8e55-6bf6809df7a7', 'litellm_logging_obj': <litellm.utils.Logging object at 0x10a5cbf50>, ...}
result = None, start_time = datetime.datetime(2024, 5, 9, 17, 37, 57, 884661)
logging_obj = <litellm.utils.Logging object at 0x10a5cbf50>
call_type = 'completion', model = 'predibase/llama-3-8b-instruct'
k = 'litellm_logging_obj'

    @wraps(original_function)
    def wrapper(*args, **kwargs):
        # DO NOT MOVE THIS. It always needs to run first
        # Check if this is an async function. If so only execute the async function
        if (
            kwargs.get("acompletion", False) == True
            or kwargs.get("aembedding", False) == True
            or kwargs.get("aimg_generation", False) == True
            or kwargs.get("amoderation", False) == True
            or kwargs.get("atext_completion", False) == True
            or kwargs.get("atranscription", False) == True
        ):
            # [OPTIONAL] CHECK MAX RETRIES / REQUEST
            if litellm.num_retries_per_request is not None:
                # check if previous_models passed in as ['litellm_params']['metadata]['previous_models']
                previous_models = kwargs.get("metadata", {}).get(
                    "previous_models", None
                )
                if previous_models is not None:
                    if litellm.num_retries_per_request <= len(previous_models):
                        raise Exception(f"Max retries per request hit!")
    
            # MODEL CALL
            result = original_function(*args, **kwargs)
            if "stream" in kwargs and kwargs["stream"] == True:
                if (
                    "complete_response" in kwargs
                    and kwargs["complete_response"] == True
                ):
                    chunks = []
                    for idx, chunk in enumerate(result):
                        chunks.append(chunk)
                    return litellm.stream_chunk_builder(
                        chunks, messages=kwargs.get("messages", None)
                    )
                else:
                    return result
            return result
    
        # Prints Exactly what was passed to litellm function - don't execute any logic here - it should just print
        print_args_passed_to_litellm(original_function, args, kwargs)
        start_time = datetime.datetime.now()
        result = None
        logging_obj = kwargs.get("litellm_logging_obj", None)
    
        # only set litellm_call_id if its not in kwargs
        call_type = original_function.__name__
        if "litellm_call_id" not in kwargs:
            kwargs["litellm_call_id"] = str(uuid.uuid4())
        try:
            model = args[0] if len(args) > 0 else kwargs["model"]
        except:
            model = None
            if (
                call_type != CallTypes.image_generation.value
                and call_type != CallTypes.text_completion.value
            ):
                raise ValueError("model param not passed in.")
    
        try:
            if logging_obj is None:
                logging_obj, kwargs = function_setup(
                    original_function.__name__, rules_obj, start_time, *args, **kwargs
                )
            kwargs["litellm_logging_obj"] = logging_obj
    
            # CHECK FOR 'os.environ/' in kwargs
            for k, v in kwargs.items():
                if v is not None and isinstance(v, str) and v.startswith("os.environ/"):
                    kwargs[k] = litellm.get_secret(v)
            # [OPTIONAL] CHECK BUDGET
            if litellm.max_budget:
                if litellm._current_cost > litellm.max_budget:
                    raise BudgetExceededError(
                        current_cost=litellm._current_cost,
                        max_budget=litellm.max_budget,
                    )
    
            # [OPTIONAL] CHECK MAX RETRIES / REQUEST
            if litellm.num_retries_per_request is not None:
                # check if previous_models passed in as ['litellm_params']['metadata]['previous_models']
                previous_models = kwargs.get("metadata", {}).get(
                    "previous_models", None
                )
                if previous_models is not None:
                    if litellm.num_retries_per_request <= len(previous_models):
                        raise Exception(f"Max retries per request hit!")
    
            # [OPTIONAL] CHECK CACHE
            print_verbose(
                f"SYNC kwargs[caching]: {kwargs.get('caching', False)}; litellm.cache: {litellm.cache}; kwargs.get('cache')['no-cache']: {kwargs.get('cache', {}).get('no-cache', False)}"
            )
            # if caching is false or cache["no-cache"]==True, don't run this
            if (
                (
                    (
                        (
                            kwargs.get("caching", None) is None
                            and litellm.cache is not None
                        )
                        or kwargs.get("caching", False) == True
                    )
                    and kwargs.get("cache", {}).get("no-cache", False) != True
                )
                and kwargs.get("aembedding", False) != True
                and kwargs.get("atext_completion", False) != True
                and kwargs.get("acompletion", False) != True
                and kwargs.get("aimg_generation", False) != True
                and kwargs.get("atranscription", False) != True
            ):  # allow users to control returning cached responses from the completion function
                # checking cache
                print_verbose(f"INSIDE CHECKING CACHE")
                if (
                    litellm.cache is not None
                    and str(original_function.__name__)
                    in litellm.cache.supported_call_types
                ):
                    print_verbose(f"Checking Cache")
                    preset_cache_key = litellm.cache.get_cache_key(*args, **kwargs)
                    kwargs["preset_cache_key"] = (
                        preset_cache_key  # for streaming calls, we need to pass the preset_cache_key
                    )
                    cached_result = litellm.cache.get_cache(*args, **kwargs)
                    if cached_result != None:
                        if "detail" in cached_result:
                            # implies an error occurred
                            pass
                        else:
                            call_type = original_function.__name__
                            print_verbose(
                                f"Cache Response Object routing: call_type - {call_type}; cached_result instace: {type(cached_result)}"
                            )
                            if call_type == CallTypes.completion.value and isinstance(
                                cached_result, dict
                            ):
                                cached_result = convert_to_model_response_object(
                                    response_object=cached_result,
                                    model_response_object=ModelResponse(),
                                    stream=kwargs.get("stream", False),
                                )
                                if kwargs.get("stream", False) == True:
                                    cached_result = CustomStreamWrapper(
                                        completion_stream=cached_result,
                                        model=model,
                                        custom_llm_provider="cached_response",
                                        logging_obj=logging_obj,
                                    )
                            elif call_type == CallTypes.embedding.value and isinstance(
                                cached_result, dict
                            ):
                                cached_result = convert_to_model_response_object(
                                    response_object=cached_result,
                                    response_type="embedding",
                                )
    
                            # LOG SUCCESS
                            cache_hit = True
                            end_time = datetime.datetime.now()
                            (
                                model,
                                custom_llm_provider,
                                dynamic_api_key,
                                api_base,
                            ) = litellm.get_llm_provider(
                                model=model,
                                custom_llm_provider=kwargs.get(
                                    "custom_llm_provider", None
                                ),
                                api_base=kwargs.get("api_base", None),
                                api_key=kwargs.get("api_key", None),
                            )
                            print_verbose(
                                f"Async Wrapper: Completed Call, calling async_success_handler: {logging_obj.async_success_handler}"
                            )
                            logging_obj.update_environment_variables(
                                model=model,
                                user=kwargs.get("user", None),
                                optional_params={},
                                litellm_params={
                                    "logger_fn": kwargs.get("logger_fn", None),
                                    "acompletion": False,
                                    "metadata": kwargs.get("metadata", {}),
                                    "model_info": kwargs.get("model_info", {}),
                                    "proxy_server_request": kwargs.get(
                                        "proxy_server_request", None
                                    ),
                                    "preset_cache_key": kwargs.get(
                                        "preset_cache_key", None
                                    ),
                                    "stream_response": kwargs.get(
                                        "stream_response", {}
                                    ),
                                },
                                input=kwargs.get("messages", ""),
                                api_key=kwargs.get("api_key", None),
                                original_response=str(cached_result),
                                additional_args=None,
                                stream=kwargs.get("stream", False),
                            )
                            threading.Thread(
                                target=logging_obj.success_handler,
                                args=(cached_result, start_time, end_time, cache_hit),
                            ).start()
                            return cached_result
    
            # CHECK MAX TOKENS
            if (
                kwargs.get("max_tokens", None) is not None
                and model is not None
                and litellm.modify_params
                == True  # user is okay with params being modified
                and (
                    call_type == CallTypes.acompletion.value
                    or call_type == CallTypes.completion.value
                )
            ):
                try:
                    base_model = model
                    if kwargs.get("hf_model_name", None) is not None:
                        base_model = f"huggingface/{kwargs.get('hf_model_name')}"
                    max_output_tokens = (
                        get_max_tokens(model=base_model) or 4096
                    )  # assume min context window is 4k tokens
                    user_max_tokens = kwargs.get("max_tokens")
                    ## Scenario 1: User limit + prompt > model limit
                    messages = None
                    if len(args) > 1:
                        messages = args[1]
                    elif kwargs.get("messages", None):
                        messages = kwargs["messages"]
                    input_tokens = token_counter(model=base_model, messages=messages)
                    input_tokens += max(
                        0.1 * input_tokens, 10
                    )  # give at least a 10 token buffer. token counting can be imprecise.
                    if input_tokens > max_output_tokens:
                        pass  # allow call to fail normally
                    elif user_max_tokens + input_tokens > max_output_tokens:
                        user_max_tokens = max_output_tokens - input_tokens
                    print_verbose(f"user_max_tokens: {user_max_tokens}")
                    kwargs["max_tokens"] = int(
                        round(user_max_tokens)
                    )  # make sure max tokens is always an int
                except Exception as e:
                    print_verbose(f"Error while checking max token limit: {str(e)}")
            # MODEL CALL
>           result = original_function(*args, **kwargs)

../utils.py:3123: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

model = 'llama-3-8b-instruct'
messages = [{'content': 'What is the meaning of life?', 'role': 'user'}]
timeout = 600.0, temperature = None, top_p = None, n = None, stream = True
stream_options = None, stop = None, max_tokens = None, presence_penalty = None
frequency_penalty = None, logit_bias = None, user = None, response_format = None
seed = None, tools = None, tool_choice = None, logprobs = None
top_logprobs = None, deployment_id = None, extra_headers = None
functions = None, function_call = None, base_url = None, api_version = None
api_key = 'pb_Qg9YbQo7UqqHdu0ozxN_aw', model_list = None
kwargs = {'api_base': 'https://serving.app.predibase.com', 'litellm_call_id': 'cf0ea464-1b45-4473-8e55-6bf6809df7a7', 'litellm_logging_obj': <litellm.utils.Logging object at 0x10a5cbf50>, 'tenant_id': 'c4768f95'}
args = {'acompletion': False, 'api_base': None, 'api_key': 'pb_Qg9YbQo7UqqHdu0ozxN_aw', 'api_version': None, ...}
api_base = None, mock_response = None, force_timeout = 600, logger_fn = None
verbose = False, custom_llm_provider = 'predibase'

    @client
    def completion(
        model: str,
        # Optional OpenAI params: see https://platform.openai.com/docs/api-reference/chat/create
        messages: List = [],
        timeout: Optional[Union[float, str, httpx.Timeout]] = None,
        temperature: Optional[float] = None,
        top_p: Optional[float] = None,
        n: Optional[int] = None,
        stream: Optional[bool] = None,
        stream_options: Optional[dict] = None,
        stop=None,
        max_tokens: Optional[int] = None,
        presence_penalty: Optional[float] = None,
        frequency_penalty: Optional[float] = None,
        logit_bias: Optional[dict] = None,
        user: Optional[str] = None,
        # openai v1.0+ new params
        response_format: Optional[dict] = None,
        seed: Optional[int] = None,
        tools: Optional[List] = None,
        tool_choice: Optional[str] = None,
        logprobs: Optional[bool] = None,
        top_logprobs: Optional[int] = None,
        deployment_id=None,
        extra_headers: Optional[dict] = None,
        # soon to be deprecated params by OpenAI
        functions: Optional[List] = None,
        function_call: Optional[str] = None,
        # set api_base, api_version, api_key
        base_url: Optional[str] = None,
        api_version: Optional[str] = None,
        api_key: Optional[str] = None,
        model_list: Optional[list] = None,  # pass in a list of api_base,keys, etc.
        # Optional liteLLM function params
        **kwargs,
    ) -> Union[ModelResponse, CustomStreamWrapper]:
        """
        Perform a completion() using any of litellm supported llms (example gpt-4, gpt-3.5-turbo, claude-2, command-nightly)
        Parameters:
            model (str): The name of the language model to use for text completion. see all supported LLMs: https://docs.litellm.ai/docs/providers/
            messages (List): A list of message objects representing the conversation context (default is an empty list).
    
            OPTIONAL PARAMS
            functions (List, optional): A list of functions to apply to the conversation messages (default is an empty list).
            function_call (str, optional): The name of the function to call within the conversation (default is an empty string).
            temperature (float, optional): The temperature parameter for controlling the randomness of the output (default is 1.0).
            top_p (float, optional): The top-p parameter for nucleus sampling (default is 1.0).
            n (int, optional): The number of completions to generate (default is 1).
            stream (bool, optional): If True, return a streaming response (default is False).
            stream_options (dict, optional): A dictionary containing options for the streaming response. Only set this when you set stream: true.
            stop(string/list, optional): - Up to 4 sequences where the LLM API will stop generating further tokens.
            max_tokens (integer, optional): The maximum number of tokens in the generated completion (default is infinity).
            presence_penalty (float, optional): It is used to penalize new tokens based on their existence in the text so far.
            frequency_penalty: It is used to penalize new tokens based on their frequency in the text so far.
            logit_bias (dict, optional): Used to modify the probability of specific tokens appearing in the completion.
            user (str, optional):  A unique identifier representing your end-user. This can help the LLM provider to monitor and detect abuse.
            logprobs (bool, optional): Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message
            top_logprobs (int, optional): An integer between 0 and 5 specifying the number of most likely tokens to return at each token position, each with an associated log probability. logprobs must be set to true if this parameter is used.
            metadata (dict, optional): Pass in additional metadata to tag your completion calls - eg. prompt version, details, etc.
            api_base (str, optional): Base URL for the API (default is None).
            api_version (str, optional): API version (default is None).
            api_key (str, optional): API key (default is None).
            model_list (list, optional): List of api base, version, keys
            extra_headers (dict, optional): Additional headers to include in the request.
    
            LITELLM Specific Params
            mock_response (str, optional): If provided, return a mock completion response for testing or debugging purposes (default is None).
            custom_llm_provider (str, optional): Used for Non-OpenAI LLMs, Example usage for bedrock, set model="amazon.titan-tg1-large" and custom_llm_provider="bedrock"
            max_retries (int, optional): The number of retries to attempt (default is 0).
        Returns:
            ModelResponse: A response object containing the generated completion and associated metadata.
    
        Note:
            - This function is used to perform completions() using the specified language model.
            - It supports various optional parameters for customizing the completion behavior.
            - If 'mock_response' is provided, a mock completion response is returned for testing or debugging.
        """
        ######### unpacking kwargs #####################
        args = locals()
        api_base = kwargs.get("api_base", None)
        mock_response = kwargs.get("mock_response", None)
        force_timeout = kwargs.get("force_timeout", 600)  ## deprecated
        logger_fn = kwargs.get("logger_fn", None)
        verbose = kwargs.get("verbose", False)
        custom_llm_provider = kwargs.get("custom_llm_provider", None)
        litellm_logging_obj = kwargs.get("litellm_logging_obj", None)
        id = kwargs.get("id", None)
        metadata = kwargs.get("metadata", None)
        model_info = kwargs.get("model_info", None)
        proxy_server_request = kwargs.get("proxy_server_request", None)
        fallbacks = kwargs.get("fallbacks", None)
        headers = kwargs.get("headers", None)
        num_retries = kwargs.get("num_retries", None)  ## deprecated
        max_retries = kwargs.get("max_retries", None)
        context_window_fallback_dict = kwargs.get("context_window_fallback_dict", None)
        organization = kwargs.get("organization", None)
        ### CUSTOM MODEL COST ###
        input_cost_per_token = kwargs.get("input_cost_per_token", None)
        output_cost_per_token = kwargs.get("output_cost_per_token", None)
        input_cost_per_second = kwargs.get("input_cost_per_second", None)
        output_cost_per_second = kwargs.get("output_cost_per_second", None)
        ### CUSTOM PROMPT TEMPLATE ###
        initial_prompt_value = kwargs.get("initial_prompt_value", None)
        roles = kwargs.get("roles", None)
        final_prompt_value = kwargs.get("final_prompt_value", None)
        bos_token = kwargs.get("bos_token", None)
        eos_token = kwargs.get("eos_token", None)
        preset_cache_key = kwargs.get("preset_cache_key", None)
        hf_model_name = kwargs.get("hf_model_name", None)
        supports_system_message = kwargs.get("supports_system_message", None)
        ### TEXT COMPLETION CALLS ###
        text_completion = kwargs.get("text_completion", False)
        atext_completion = kwargs.get("atext_completion", False)
        ### ASYNC CALLS ###
        acompletion = kwargs.get("acompletion", False)
        client = kwargs.get("client", None)
        ### Admin Controls ###
        no_log = kwargs.get("no-log", False)
        ######## end of unpacking kwargs ###########
        openai_params = [
            "functions",
            "function_call",
            "temperature",
            "temperature",
            "top_p",
            "n",
            "stream",
            "stream_options",
            "stop",
            "max_tokens",
            "presence_penalty",
            "frequency_penalty",
            "logit_bias",
            "user",
            "request_timeout",
            "api_base",
            "api_version",
            "api_key",
            "deployment_id",
            "organization",
            "base_url",
            "default_headers",
            "timeout",
            "response_format",
            "seed",
            "tools",
            "tool_choice",
            "max_retries",
            "logprobs",
            "top_logprobs",
            "extra_headers",
        ]
        litellm_params = [
            "metadata",
            "acompletion",
            "atext_completion",
            "text_completion",
            "caching",
            "mock_response",
            "api_key",
            "api_version",
            "api_base",
            "force_timeout",
            "logger_fn",
            "verbose",
            "custom_llm_provider",
            "litellm_logging_obj",
            "litellm_call_id",
            "use_client",
            "id",
            "fallbacks",
            "azure",
            "headers",
            "model_list",
            "num_retries",
            "context_window_fallback_dict",
            "retry_policy",
            "roles",
            "final_prompt_value",
            "bos_token",
            "eos_token",
            "request_timeout",
            "complete_response",
            "self",
            "client",
            "rpm",
            "tpm",
            "max_parallel_requests",
            "input_cost_per_token",
            "output_cost_per_token",
            "input_cost_per_second",
            "output_cost_per_second",
            "hf_model_name",
            "model_info",
            "proxy_server_request",
            "preset_cache_key",
            "caching_groups",
            "ttl",
            "cache",
            "no-log",
            "base_model",
            "stream_timeout",
            "supports_system_message",
            "region_name",
            "allowed_model_region",
        ]
        default_params = openai_params + litellm_params
        non_default_params = {
            k: v for k, v in kwargs.items() if k not in default_params
        }  # model-specific params - pass them straight to the model/provider
    
        ### TIMEOUT LOGIC ###
        timeout = timeout or kwargs.get("request_timeout", 600) or 600
        # set timeout for 10 minutes by default
    
        if (
            timeout is not None
            and isinstance(timeout, httpx.Timeout)
            and supports_httpx_timeout(custom_llm_provider) == False
        ):
            read_timeout = timeout.read or 600
            timeout = read_timeout  # default 10 min timeout
        elif timeout is not None and not isinstance(timeout, httpx.Timeout):
            timeout = float(timeout)  # type: ignore
    
        try:
            if base_url is not None:
                api_base = base_url
            if max_retries is not None:  # openai allows openai.OpenAI(max_retries=3)
                num_retries = max_retries
            logging = litellm_logging_obj
            fallbacks = fallbacks or litellm.model_fallbacks
            if fallbacks is not None:
                return completion_with_fallbacks(**args)
            if model_list is not None:
                deployments = [
                    m["litellm_params"] for m in model_list if m["model_name"] == model
                ]
                return batch_completion_models(deployments=deployments, **args)
            if litellm.model_alias_map and model in litellm.model_alias_map:
                model = litellm.model_alias_map[
                    model
                ]  # update the model to the actual value if an alias has been passed in
            model_response = ModelResponse()
            setattr(model_response, "usage", litellm.Usage())
            if (
                kwargs.get("azure", False) == True
            ):  # don't remove flag check, to remain backwards compatible for repos like Codium
                custom_llm_provider = "azure"
            if deployment_id != None:  # azure llms
                model = deployment_id
                custom_llm_provider = "azure"
            model, custom_llm_provider, dynamic_api_key, api_base = get_llm_provider(
                model=model,
                custom_llm_provider=custom_llm_provider,
                api_base=api_base,
                api_key=api_key,
            )
            if model_response is not None and hasattr(model_response, "_hidden_params"):
                model_response._hidden_params["custom_llm_provider"] = custom_llm_provider
                model_response._hidden_params["region_name"] = kwargs.get(
                    "aws_region_name", None
                )  # support region-based pricing for bedrock
    
            ### REGISTER CUSTOM MODEL PRICING -- IF GIVEN ###
            if input_cost_per_token is not None and output_cost_per_token is not None:
                print_verbose(f"Registering model={model} in model cost map")
                litellm.register_model(
                    {
                        f"{custom_llm_provider}/{model}": {
                            "input_cost_per_token": input_cost_per_token,
                            "output_cost_per_token": output_cost_per_token,
                            "litellm_provider": custom_llm_provider,
                        },
                        model: {
                            "input_cost_per_token": input_cost_per_token,
                            "output_cost_per_token": output_cost_per_token,
                            "litellm_provider": custom_llm_provider,
                        },
                    }
                )
            elif (
                input_cost_per_second is not None
            ):  # time based pricing just needs cost in place
                output_cost_per_second = output_cost_per_second
                litellm.register_model(
                    {
                        f"{custom_llm_provider}/{model}": {
                            "input_cost_per_second": input_cost_per_second,
                            "output_cost_per_second": output_cost_per_second,
                            "litellm_provider": custom_llm_provider,
                        },
                        model: {
                            "input_cost_per_second": input_cost_per_second,
                            "output_cost_per_second": output_cost_per_second,
                            "litellm_provider": custom_llm_provider,
                        },
                    }
                )
            ### BUILD CUSTOM PROMPT TEMPLATE -- IF GIVEN ###
            custom_prompt_dict = {}  # type: ignore
            if (
                initial_prompt_value
                or roles
                or final_prompt_value
                or bos_token
                or eos_token
            ):
                custom_prompt_dict = {model: {}}
                if initial_prompt_value:
                    custom_prompt_dict[model]["initial_prompt_value"] = initial_prompt_value
                if roles:
                    custom_prompt_dict[model]["roles"] = roles
                if final_prompt_value:
                    custom_prompt_dict[model]["final_prompt_value"] = final_prompt_value
                if bos_token:
                    custom_prompt_dict[model]["bos_token"] = bos_token
                if eos_token:
                    custom_prompt_dict[model]["eos_token"] = eos_token
    
            if (
                supports_system_message is not None
                and isinstance(supports_system_message, bool)
                and supports_system_message == False
            ):
                messages = map_system_message_pt(messages=messages)
            model_api_key = get_api_key(
                llm_provider=custom_llm_provider, dynamic_api_key=api_key
            )  # get the api key from the environment if required for the model
    
            if dynamic_api_key is not None:
                api_key = dynamic_api_key
            # check if user passed in any of the OpenAI optional params
            optional_params = get_optional_params(
                functions=functions,
                function_call=function_call,
                temperature=temperature,
                top_p=top_p,
                n=n,
                stream=stream,
                stream_options=stream_options,
                stop=stop,
                max_tokens=max_tokens,
                presence_penalty=presence_penalty,
                frequency_penalty=frequency_penalty,
                logit_bias=logit_bias,
                user=user,
                # params to identify the model
                model=model,
                custom_llm_provider=custom_llm_provider,
                response_format=response_format,
                seed=seed,
                tools=tools,
                tool_choice=tool_choice,
                max_retries=max_retries,
                logprobs=logprobs,
                top_logprobs=top_logprobs,
                extra_headers=extra_headers,
                **non_default_params,
            )
    
            if litellm.add_function_to_prompt and optional_params.get(
                "functions_unsupported_model", None
            ):  # if user opts to add it to prompt, when API doesn't support function calling
                functions_unsupported_model = optional_params.pop(
                    "functions_unsupported_model"
                )
                messages = function_call_prompt(
                    messages=messages, functions=functions_unsupported_model
                )
    
            # For logging - save the values of the litellm-specific params passed in
            litellm_params = get_litellm_params(
                acompletion=acompletion,
                api_key=api_key,
                force_timeout=force_timeout,
                logger_fn=logger_fn,
                verbose=verbose,
                custom_llm_provider=custom_llm_provider,
                api_base=api_base,
                litellm_call_id=kwargs.get("litellm_call_id", None),
                model_alias_map=litellm.model_alias_map,
                completion_call_id=id,
                metadata=metadata,
                model_info=model_info,
                proxy_server_request=proxy_server_request,
                preset_cache_key=preset_cache_key,
                no_log=no_log,
            )
            logging.update_environment_variables(
                model=model,
                user=user,
                optional_params=optional_params,
                litellm_params=litellm_params,
            )
            if mock_response:
                return mock_completion(
                    model,
                    messages,
                    stream=stream,
                    mock_response=mock_response,
                    logging=logging,
                    acompletion=acompletion,
                )
            if custom_llm_provider == "azure":
                # azure configs
                api_type = get_secret("AZURE_API_TYPE") or "azure"
    
                api_base = api_base or litellm.api_base or get_secret("AZURE_API_BASE")
    
                api_version = (
                    api_version or litellm.api_version or get_secret("AZURE_API_VERSION")
                )
    
                api_key = (
                    api_key
                    or litellm.api_key
                    or litellm.azure_key
                    or get_secret("AZURE_OPENAI_API_KEY")
                    or get_secret("AZURE_API_KEY")
                )
    
                azure_ad_token = optional_params.get("extra_body", {}).pop(
                    "azure_ad_token", None
                ) or get_secret("AZURE_AD_TOKEN")
    
                headers = headers or litellm.headers
    
                ## LOAD CONFIG - if set
                config = litellm.AzureOpenAIConfig.get_config()
                for k, v in config.items():
                    if (
                        k not in optional_params
                    ):  # completion(top_k=3) > azure_config(top_k=3) <- allows for dynamic variables to be passed in
                        optional_params[k] = v
    
                ## COMPLETION CALL
                response = azure_chat_completions.completion(
                    model=model,
                    messages=messages,
                    headers=headers,
                    api_key=api_key,
                    api_base=api_base,
                    api_version=api_version,
                    api_type=api_type,
                    azure_ad_token=azure_ad_token,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    logging_obj=logging,
                    acompletion=acompletion,
                    timeout=timeout,  # type: ignore
                    client=client,  # pass AsyncAzureOpenAI, AzureOpenAI client
                )
    
                if optional_params.get("stream", False) or acompletion == True:
                    ## LOGGING
                    logging.post_call(
                        input=messages,
                        api_key=api_key,
                        original_response=response,
                        additional_args={
                            "headers": headers,
                            "api_version": api_version,
                            "api_base": api_base,
                        },
                    )
            elif custom_llm_provider == "azure_text":
                # azure configs
                api_type = get_secret("AZURE_API_TYPE") or "azure"
    
                api_base = api_base or litellm.api_base or get_secret("AZURE_API_BASE")
    
                api_version = (
                    api_version or litellm.api_version or get_secret("AZURE_API_VERSION")
                )
    
                api_key = (
                    api_key
                    or litellm.api_key
                    or litellm.azure_key
                    or get_secret("AZURE_OPENAI_API_KEY")
                    or get_secret("AZURE_API_KEY")
                )
    
                azure_ad_token = optional_params.get("extra_body", {}).pop(
                    "azure_ad_token", None
                ) or get_secret("AZURE_AD_TOKEN")
    
                headers = headers or litellm.headers
    
                ## LOAD CONFIG - if set
                config = litellm.AzureOpenAIConfig.get_config()
                for k, v in config.items():
                    if (
                        k not in optional_params
                    ):  # completion(top_k=3) > azure_config(top_k=3) <- allows for dynamic variables to be passed in
                        optional_params[k] = v
    
                ## COMPLETION CALL
                response = azure_text_completions.completion(
                    model=model,
                    messages=messages,
                    headers=headers,
                    api_key=api_key,
                    api_base=api_base,
                    api_version=api_version,
                    api_type=api_type,
                    azure_ad_token=azure_ad_token,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    logging_obj=logging,
                    acompletion=acompletion,
                    timeout=timeout,
                    client=client,  # pass AsyncAzureOpenAI, AzureOpenAI client
                )
    
                if optional_params.get("stream", False) or acompletion == True:
                    ## LOGGING
                    logging.post_call(
                        input=messages,
                        api_key=api_key,
                        original_response=response,
                        additional_args={
                            "headers": headers,
                            "api_version": api_version,
                            "api_base": api_base,
                        },
                    )
            elif (
                model in litellm.open_ai_chat_completion_models
                or custom_llm_provider == "custom_openai"
                or custom_llm_provider == "deepinfra"
                or custom_llm_provider == "perplexity"
                or custom_llm_provider == "groq"
                or custom_llm_provider == "deepseek"
                or custom_llm_provider == "anyscale"
                or custom_llm_provider == "mistral"
                or custom_llm_provider == "openai"
                or custom_llm_provider == "together_ai"
                or custom_llm_provider in litellm.openai_compatible_providers
                or "ft:gpt-3.5-turbo" in model  # finetune gpt-3.5-turbo
            ):  # allow user to make an openai call with a custom base
                # note: if a user sets a custom base - we should ensure this works
                # allow for the setting of dynamic and stateful api-bases
                api_base = (
                    api_base  # for deepinfra/perplexity/anyscale/groq we check in get_llm_provider and pass in the api base from there
                    or litellm.api_base
                    or get_secret("OPENAI_API_BASE")
                    or "https://api.openai.com/v1"
                )
                openai.organization = (
                    organization
                    or litellm.organization
                    or get_secret("OPENAI_ORGANIZATION")
                    or None  # default - https://github.com/openai/openai-python/blob/284c1799070c723c6a553337134148a7ab088dd8/openai/util.py#L105
                )
                # set API KEY
                api_key = (
                    api_key
                    or litellm.api_key  # for deepinfra/perplexity/anyscale we check in get_llm_provider and pass in the api key from there
                    or litellm.openai_key
                    or get_secret("OPENAI_API_KEY")
                )
    
                headers = headers or litellm.headers
    
                ## LOAD CONFIG - if set
                config = litellm.OpenAIConfig.get_config()
                for k, v in config.items():
                    if (
                        k not in optional_params
                    ):  # completion(top_k=3) > openai_config(top_k=3) <- allows for dynamic variables to be passed in
                        optional_params[k] = v
    
                ## COMPLETION CALL
                try:
                    response = openai_chat_completions.completion(
                        model=model,
                        messages=messages,
                        headers=headers,
                        model_response=model_response,
                        print_verbose=print_verbose,
                        api_key=api_key,
                        api_base=api_base,
                        acompletion=acompletion,
                        logging_obj=logging,
                        optional_params=optional_params,
                        litellm_params=litellm_params,
                        logger_fn=logger_fn,
                        timeout=timeout,  # type: ignore
                        custom_prompt_dict=custom_prompt_dict,
                        client=client,  # pass AsyncOpenAI, OpenAI client
                        organization=organization,
                        custom_llm_provider=custom_llm_provider,
                    )
                except Exception as e:
                    ## LOGGING - log the original exception returned
                    logging.post_call(
                        input=messages,
                        api_key=api_key,
                        original_response=str(e),
                        additional_args={"headers": headers},
                    )
                    raise e
    
                if optional_params.get("stream", False):
                    ## LOGGING
                    logging.post_call(
                        input=messages,
                        api_key=api_key,
                        original_response=response,
                        additional_args={"headers": headers},
                    )
            elif (
                custom_llm_provider == "text-completion-openai"
                or "ft:babbage-002" in model
                or "ft:davinci-002" in model  # support for finetuned completion models
            ):
                openai.api_type = "openai"
    
                api_base = (
                    api_base
                    or litellm.api_base
                    or get_secret("OPENAI_API_BASE")
                    or "https://api.openai.com/v1"
                )
    
                openai.api_version = None
                # set API KEY
    
                api_key = (
                    api_key
                    or litellm.api_key
                    or litellm.openai_key
                    or get_secret("OPENAI_API_KEY")
                )
    
                headers = headers or litellm.headers
    
                ## LOAD CONFIG - if set
                config = litellm.OpenAITextCompletionConfig.get_config()
                for k, v in config.items():
                    if (
                        k not in optional_params
                    ):  # completion(top_k=3) > openai_text_config(top_k=3) <- allows for dynamic variables to be passed in
                        optional_params[k] = v
                if litellm.organization:
                    openai.organization = litellm.organization
    
                if (
                    len(messages) > 0
                    and "content" in messages[0]
                    and type(messages[0]["content"]) == list
                ):
                    # text-davinci-003 can accept a string or array, if it's an array, assume the array is set in messages[0]['content']
                    # https://platform.openai.com/docs/api-reference/completions/create
                    prompt = messages[0]["content"]
                else:
                    prompt = " ".join([message["content"] for message in messages])  # type: ignore
    
                ## COMPLETION CALL
                _response = openai_text_completions.completion(
                    model=model,
                    messages=messages,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    api_key=api_key,
                    api_base=api_base,
                    acompletion=acompletion,
                    client=client,  # pass AsyncOpenAI, OpenAI client
                    logging_obj=logging,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    timeout=timeout,  # type: ignore
                )
    
                if (
                    optional_params.get("stream", False) == False
                    and acompletion == False
                    and text_completion == False
                ):
                    # convert to chat completion response
                    _response = litellm.OpenAITextCompletionConfig().convert_to_chat_model_response_object(
                        response_object=_response, model_response_object=model_response
                    )
    
                if optional_params.get("stream", False) or acompletion == True:
                    ## LOGGING
                    logging.post_call(
                        input=messages,
                        api_key=api_key,
                        original_response=_response,
                        additional_args={"headers": headers},
                    )
                response = _response
            elif (
                "replicate" in model
                or custom_llm_provider == "replicate"
                or model in litellm.replicate_models
            ):
                # Setting the relevant API KEY for replicate, replicate defaults to using os.environ.get("REPLICATE_API_TOKEN")
                replicate_key = None
                replicate_key = (
                    api_key
                    or litellm.replicate_key
                    or litellm.api_key
                    or get_secret("REPLICATE_API_KEY")
                    or get_secret("REPLICATE_API_TOKEN")
                )
    
                api_base = (
                    api_base
                    or litellm.api_base
                    or get_secret("REPLICATE_API_BASE")
                    or "https://api.replicate.com/v1"
                )
    
                custom_prompt_dict = custom_prompt_dict or litellm.custom_prompt_dict
    
                model_response = replicate.completion(
                    model=model,
                    messages=messages,
                    api_base=api_base,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,  # for calculating input/output tokens
                    api_key=replicate_key,
                    logging_obj=logging,
                    custom_prompt_dict=custom_prompt_dict,
                )
                if "stream" in optional_params and optional_params["stream"] == True:
                    # don't try to access stream object,
                    model_response = CustomStreamWrapper(model_response, model, logging_obj=logging, custom_llm_provider="replicate")  # type: ignore
    
                if optional_params.get("stream", False) or acompletion == True:
                    ## LOGGING
                    logging.post_call(
                        input=messages,
                        api_key=replicate_key,
                        original_response=model_response,
                    )
    
                response = model_response
    
            elif custom_llm_provider == "anthropic":
                api_key = (
                    api_key
                    or litellm.anthropic_key
                    or litellm.api_key
                    or os.environ.get("ANTHROPIC_API_KEY")
                )
                custom_prompt_dict = custom_prompt_dict or litellm.custom_prompt_dict
    
                if (model == "claude-2") or (model == "claude-instant-1"):
                    # call anthropic /completion, only use this route for claude-2, claude-instant-1
                    api_base = (
                        api_base
                        or litellm.api_base
                        or get_secret("ANTHROPIC_API_BASE")
                        or "https://api.anthropic.com/v1/complete"
                    )
                    response = anthropic_text_completions.completion(
                        model=model,
                        messages=messages,
                        api_base=api_base,
                        acompletion=acompletion,
                        custom_prompt_dict=litellm.custom_prompt_dict,
                        model_response=model_response,
                        print_verbose=print_verbose,
                        optional_params=optional_params,
                        litellm_params=litellm_params,
                        logger_fn=logger_fn,
                        encoding=encoding,  # for calculating input/output tokens
                        api_key=api_key,
                        logging_obj=logging,
                        headers=headers,
                    )
                else:
                    # call /messages
                    # default route for all anthropic models
                    api_base = (
                        api_base
                        or litellm.api_base
                        or get_secret("ANTHROPIC_API_BASE")
                        or "https://api.anthropic.com/v1/messages"
                    )
                    response = anthropic_chat_completions.completion(
                        model=model,
                        messages=messages,
                        api_base=api_base,
                        acompletion=acompletion,
                        custom_prompt_dict=litellm.custom_prompt_dict,
                        model_response=model_response,
                        print_verbose=print_verbose,
                        optional_params=optional_params,
                        litellm_params=litellm_params,
                        logger_fn=logger_fn,
                        encoding=encoding,  # for calculating input/output tokens
                        api_key=api_key,
                        logging_obj=logging,
                        headers=headers,
                    )
                if optional_params.get("stream", False) or acompletion == True:
                    ## LOGGING
                    logging.post_call(
                        input=messages,
                        api_key=api_key,
                        original_response=response,
                    )
                response = response
            elif custom_llm_provider == "nlp_cloud":
                nlp_cloud_key = (
                    api_key
                    or litellm.nlp_cloud_key
                    or get_secret("NLP_CLOUD_API_KEY")
                    or litellm.api_key
                )
    
                api_base = (
                    api_base
                    or litellm.api_base
                    or get_secret("NLP_CLOUD_API_BASE")
                    or "https://api.nlpcloud.io/v1/gpu/"
                )
    
                response = nlp_cloud.completion(
                    model=model,
                    messages=messages,
                    api_base=api_base,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,
                    api_key=nlp_cloud_key,
                    logging_obj=logging,
                )
    
                if "stream" in optional_params and optional_params["stream"] == True:
                    # don't try to access stream object,
                    response = CustomStreamWrapper(
                        response,
                        model,
                        custom_llm_provider="nlp_cloud",
                        logging_obj=logging,
                    )
    
                if optional_params.get("stream", False) or acompletion == True:
                    ## LOGGING
                    logging.post_call(
                        input=messages,
                        api_key=api_key,
                        original_response=response,
                    )
    
                response = response
            elif custom_llm_provider == "aleph_alpha":
                aleph_alpha_key = (
                    api_key
                    or litellm.aleph_alpha_key
                    or get_secret("ALEPH_ALPHA_API_KEY")
                    or get_secret("ALEPHALPHA_API_KEY")
                    or litellm.api_key
                )
    
                api_base = (
                    api_base
                    or litellm.api_base
                    or get_secret("ALEPH_ALPHA_API_BASE")
                    or "https://api.aleph-alpha.com/complete"
                )
    
                model_response = aleph_alpha.completion(
                    model=model,
                    messages=messages,
                    api_base=api_base,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,
                    default_max_tokens_to_sample=litellm.max_tokens,
                    api_key=aleph_alpha_key,
                    logging_obj=logging,  # model call logging done inside the class as we make need to modify I/O to fit aleph alpha's requirements
                )
    
                if "stream" in optional_params and optional_params["stream"] == True:
                    # don't try to access stream object,
                    response = CustomStreamWrapper(
                        model_response,
                        model,
                        custom_llm_provider="aleph_alpha",
                        logging_obj=logging,
                    )
                    return response
                response = model_response
            elif custom_llm_provider == "cohere":
                cohere_key = (
                    api_key
                    or litellm.cohere_key
                    or get_secret("COHERE_API_KEY")
                    or get_secret("CO_API_KEY")
                    or litellm.api_key
                )
    
                api_base = (
                    api_base
                    or litellm.api_base
                    or get_secret("COHERE_API_BASE")
                    or "https://api.cohere.ai/v1/generate"
                )
    
                model_response = cohere.completion(
                    model=model,
                    messages=messages,
                    api_base=api_base,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,
                    api_key=cohere_key,
                    logging_obj=logging,  # model call logging done inside the class as we make need to modify I/O to fit aleph alpha's requirements
                )
    
                if "stream" in optional_params and optional_params["stream"] == True:
                    # don't try to access stream object,
                    response = CustomStreamWrapper(
                        model_response,
                        model,
                        custom_llm_provider="cohere",
                        logging_obj=logging,
                    )
                    return response
                response = model_response
            elif custom_llm_provider == "cohere_chat":
                cohere_key = (
                    api_key
                    or litellm.cohere_key
                    or get_secret("COHERE_API_KEY")
                    or get_secret("CO_API_KEY")
                    or litellm.api_key
                )
    
                api_base = (
                    api_base
                    or litellm.api_base
                    or get_secret("COHERE_API_BASE")
                    or "https://api.cohere.ai/v1/chat"
                )
    
                model_response = cohere_chat.completion(
                    model=model,
                    messages=messages,
                    api_base=api_base,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,
                    api_key=cohere_key,
                    logging_obj=logging,  # model call logging done inside the class as we make need to modify I/O to fit aleph alpha's requirements
                )
    
                if "stream" in optional_params and optional_params["stream"] == True:
                    # don't try to access stream object,
                    response = CustomStreamWrapper(
                        model_response,
                        model,
                        custom_llm_provider="cohere_chat",
                        logging_obj=logging,
                    )
                    return response
                response = model_response
            elif custom_llm_provider == "maritalk":
                maritalk_key = (
                    api_key
                    or litellm.maritalk_key
                    or get_secret("MARITALK_API_KEY")
                    or litellm.api_key
                )
    
                api_base = (
                    api_base
                    or litellm.api_base
                    or get_secret("MARITALK_API_BASE")
                    or "https://chat.maritaca.ai/api/chat/inference"
                )
    
                model_response = maritalk.completion(
                    model=model,
                    messages=messages,
                    api_base=api_base,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,
                    api_key=maritalk_key,
                    logging_obj=logging,
                )
    
                if "stream" in optional_params and optional_params["stream"] == True:
                    # don't try to access stream object,
                    response = CustomStreamWrapper(
                        model_response,
                        model,
                        custom_llm_provider="maritalk",
                        logging_obj=logging,
                    )
                    return response
                response = model_response
            elif custom_llm_provider == "huggingface":
                custom_llm_provider = "huggingface"
                huggingface_key = (
                    api_key
                    or litellm.huggingface_key
                    or os.environ.get("HF_TOKEN")
                    or os.environ.get("HUGGINGFACE_API_KEY")
                    or litellm.api_key
                )
                hf_headers = headers or litellm.headers
    
                custom_prompt_dict = custom_prompt_dict or litellm.custom_prompt_dict
                model_response = huggingface.completion(
                    model=model,
                    messages=messages,
                    api_base=api_base,  # type: ignore
                    headers=hf_headers,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,
                    api_key=huggingface_key,
                    acompletion=acompletion,
                    logging_obj=logging,
                    custom_prompt_dict=custom_prompt_dict,
                    timeout=timeout,  # type: ignore
                )
                if (
                    "stream" in optional_params
                    and optional_params["stream"] == True
                    and acompletion is False
                ):
                    # don't try to access stream object,
                    response = CustomStreamWrapper(
                        model_response,
                        model,
                        custom_llm_provider="huggingface",
                        logging_obj=logging,
                    )
                    return response
                response = model_response
            elif custom_llm_provider == "oobabooga":
                custom_llm_provider = "oobabooga"
                model_response = oobabooga.completion(
                    model=model,
                    messages=messages,
                    model_response=model_response,
                    api_base=api_base,  # type: ignore
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    api_key=None,
                    logger_fn=logger_fn,
                    encoding=encoding,
                    logging_obj=logging,
                )
                if "stream" in optional_params and optional_params["stream"] == True:
                    # don't try to access stream object,
                    response = CustomStreamWrapper(
                        model_response,
                        model,
                        custom_llm_provider="oobabooga",
                        logging_obj=logging,
                    )
                    return response
                response = model_response
            elif custom_llm_provider == "openrouter":
                api_base = api_base or litellm.api_base or "https://openrouter.ai/api/v1"
    
                api_key = (
                    api_key
                    or litellm.api_key
                    or litellm.openrouter_key
                    or get_secret("OPENROUTER_API_KEY")
                    or get_secret("OR_API_KEY")
                )
    
                openrouter_site_url = get_secret("OR_SITE_URL") or "https://litellm.ai"
    
                openrouter_app_name = get_secret("OR_APP_NAME") or "liteLLM"
    
                headers = (
                    headers
                    or litellm.headers
                    or {
                        "HTTP-Referer": openrouter_site_url,
                        "X-Title": openrouter_app_name,
                    }
                )
    
                ## Load Config
                config = openrouter.OpenrouterConfig.get_config()
                for k, v in config.items():
                    if k == "extra_body":
                        # we use openai 'extra_body' to pass openrouter specific params - transforms, route, models
                        if "extra_body" in optional_params:
                            optional_params[k].update(v)
                        else:
                            optional_params[k] = v
                    elif k not in optional_params:
                        optional_params[k] = v
    
                data = {"model": model, "messages": messages, **optional_params}
    
                ## COMPLETION CALL
                response = openai_chat_completions.completion(
                    model=model,
                    messages=messages,
                    headers=headers,
                    api_key=api_key,
                    api_base=api_base,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    logging_obj=logging,
                    acompletion=acompletion,
                    timeout=timeout,  # type: ignore
                )
                ## LOGGING
                logging.post_call(
                    input=messages, api_key=openai.api_key, original_response=response
                )
            elif (
                custom_llm_provider == "together_ai"
                or ("togethercomputer" in model)
                or (model in litellm.together_ai_models)
            ):
                """
                Deprecated. We now do together ai calls via the openai client - https://docs.together.ai/docs/openai-api-compatibility
                """
                custom_llm_provider = "together_ai"
                together_ai_key = (
                    api_key
                    or litellm.togetherai_api_key
                    or get_secret("TOGETHER_AI_TOKEN")
                    or get_secret("TOGETHERAI_API_KEY")
                    or litellm.api_key
                )
    
                api_base = (
                    api_base
                    or litellm.api_base
                    or get_secret("TOGETHERAI_API_BASE")
                    or "https://api.together.xyz/inference"
                )
    
                custom_prompt_dict = custom_prompt_dict or litellm.custom_prompt_dict
    
                model_response = together_ai.completion(
                    model=model,
                    messages=messages,
                    api_base=api_base,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,
                    api_key=together_ai_key,
                    logging_obj=logging,
                    custom_prompt_dict=custom_prompt_dict,
                )
                if (
                    "stream_tokens" in optional_params
                    and optional_params["stream_tokens"] == True
                ):
                    # don't try to access stream object,
                    response = CustomStreamWrapper(
                        model_response,
                        model,
                        custom_llm_provider="together_ai",
                        logging_obj=logging,
                    )
                    return response
                response = model_response
            elif custom_llm_provider == "palm":
                palm_api_key = api_key or get_secret("PALM_API_KEY") or litellm.api_key
    
                # palm does not support streaming as yet :(
                model_response = palm.completion(
                    model=model,
                    messages=messages,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,
                    api_key=palm_api_key,
                    logging_obj=logging,
                )
                # fake palm streaming
                if "stream" in optional_params and optional_params["stream"] == True:
                    # fake streaming for palm
                    resp_string = model_response["choices"][0]["message"]["content"]
                    response = CustomStreamWrapper(
                        resp_string, model, custom_llm_provider="palm", logging_obj=logging
                    )
                    return response
                response = model_response
            elif custom_llm_provider == "gemini":
                gemini_api_key = (
                    api_key
                    or get_secret("GEMINI_API_KEY")
                    or get_secret("PALM_API_KEY")  # older palm api key should also work
                    or litellm.api_key
                )
    
                # palm does not support streaming as yet :(
                model_response = gemini.completion(
                    model=model,
                    messages=messages,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,
                    api_key=gemini_api_key,
                    logging_obj=logging,
                    acompletion=acompletion,
                    custom_prompt_dict=custom_prompt_dict,
                )
                if (
                    "stream" in optional_params
                    and optional_params["stream"] == True
                    and acompletion == False
                ):
                    response = CustomStreamWrapper(
                        iter(model_response),
                        model,
                        custom_llm_provider="gemini",
                        logging_obj=logging,
                    )
                    return response
                response = model_response
            elif custom_llm_provider == "vertex_ai":
                vertex_ai_project = (
                    optional_params.pop("vertex_project", None)
                    or optional_params.pop("vertex_ai_project", None)
                    or litellm.vertex_project
                    or get_secret("VERTEXAI_PROJECT")
                )
                vertex_ai_location = (
                    optional_params.pop("vertex_location", None)
                    or optional_params.pop("vertex_ai_location", None)
                    or litellm.vertex_location
                    or get_secret("VERTEXAI_LOCATION")
                )
                vertex_credentials = (
                    optional_params.pop("vertex_credentials", None)
                    or optional_params.pop("vertex_ai_credentials", None)
                    or get_secret("VERTEXAI_CREDENTIALS")
                )
                new_params = deepcopy(optional_params)
                if "claude-3" in model:
                    model_response = vertex_ai_anthropic.completion(
                        model=model,
                        messages=messages,
                        model_response=model_response,
                        print_verbose=print_verbose,
                        optional_params=new_params,
                        litellm_params=litellm_params,
                        logger_fn=logger_fn,
                        encoding=encoding,
                        vertex_location=vertex_ai_location,
                        vertex_project=vertex_ai_project,
                        vertex_credentials=vertex_credentials,
                        logging_obj=logging,
                        acompletion=acompletion,
                    )
                else:
                    model_response = vertex_ai.completion(
                        model=model,
                        messages=messages,
                        model_response=model_response,
                        print_verbose=print_verbose,
                        optional_params=new_params,
                        litellm_params=litellm_params,
                        logger_fn=logger_fn,
                        encoding=encoding,
                        vertex_location=vertex_ai_location,
                        vertex_project=vertex_ai_project,
                        vertex_credentials=vertex_credentials,
                        logging_obj=logging,
                        acompletion=acompletion,
                    )
    
                if (
                    "stream" in optional_params
                    and optional_params["stream"] == True
                    and acompletion == False
                ):
                    response = CustomStreamWrapper(
                        model_response,
                        model,
                        custom_llm_provider="vertex_ai",
                        logging_obj=logging,
                    )
                    return response
                response = model_response
            elif custom_llm_provider == "predibase":
                tenant_id = (
                    optional_params.pop("tenant_id", None)
                    or optional_params.pop("predibase_tenant_id", None)
                    or litellm.predibase_tenant_id
                    or get_secret("PREDIBASE_TENANT_ID")
                )
    
                api_base = (
                    optional_params.pop("api_base", None)
                    or optional_params.pop("base_url", None)
                    or litellm.api_base
                    or get_secret("PREDIBASE_API_BASE")
                )
    
                api_key = (
                    api_key
                    or litellm.api_key
                    or litellm.predibase_key
                    or get_secret("PREDIBASE_API_KEY")
                )
    
                model_response = predibase_chat_completions.completion(
                    model=model,
                    messages=messages,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,
                    logging_obj=logging,
                    acompletion=acompletion,
                    api_base=api_base,
                    custom_prompt_dict=custom_prompt_dict,
                    api_key=api_key,
                    tenant_id=tenant_id,
                )
    
                if (
                    "stream" in optional_params
                    and optional_params["stream"] == True
                    and acompletion == False
                ):
                    return response
                response = model_response
            elif custom_llm_provider == "ai21":
                custom_llm_provider = "ai21"
                ai21_key = (
                    api_key
                    or litellm.ai21_key
                    or os.environ.get("AI21_API_KEY")
                    or litellm.api_key
                )
    
                api_base = (
                    api_base
                    or litellm.api_base
                    or get_secret("AI21_API_BASE")
                    or "https://api.ai21.com/studio/v1/"
                )
    
                model_response = ai21.completion(
                    model=model,
                    messages=messages,
                    api_base=api_base,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,
                    api_key=ai21_key,
                    logging_obj=logging,
                )
    
                if "stream" in optional_params and optional_params["stream"] == True:
                    # don't try to access stream object,
                    response = CustomStreamWrapper(
                        model_response,
                        model,
                        custom_llm_provider="ai21",
                        logging_obj=logging,
                    )
                    return response
    
                ## RESPONSE OBJECT
                response = model_response
            elif custom_llm_provider == "sagemaker":
                # boto3 reads keys from .env
                model_response = sagemaker.completion(
                    model=model,
                    messages=messages,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    custom_prompt_dict=custom_prompt_dict,
                    hf_model_name=hf_model_name,
                    logger_fn=logger_fn,
                    encoding=encoding,
                    logging_obj=logging,
                    acompletion=acompletion,
                )
                if (
                    "stream" in optional_params and optional_params["stream"] == True
                ):  ## [BETA]
                    print_verbose(f"ENTERS SAGEMAKER CUSTOMSTREAMWRAPPER")
                    from .llms.sagemaker import TokenIterator
    
                    tokenIterator = TokenIterator(model_response, acompletion=acompletion)
                    response = CustomStreamWrapper(
                        completion_stream=tokenIterator,
                        model=model,
                        custom_llm_provider="sagemaker",
                        logging_obj=logging,
                    )
                    ## LOGGING
                    logging.post_call(
                        input=messages,
                        api_key=None,
                        original_response=response,
                    )
                    return response
    
                ## RESPONSE OBJECT
                response = model_response
            elif custom_llm_provider == "bedrock":
                # boto3 reads keys from .env
                custom_prompt_dict = custom_prompt_dict or litellm.custom_prompt_dict
                response = bedrock.completion(
                    model=model,
                    messages=messages,
                    custom_prompt_dict=litellm.custom_prompt_dict,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,
                    logging_obj=logging,
                    extra_headers=extra_headers,
                    timeout=timeout,
                )
    
                if (
                    "stream" in optional_params
                    and optional_params["stream"] == True
                    and not isinstance(response, CustomStreamWrapper)
                ):
                    # don't try to access stream object,
                    if "ai21" in model:
                        response = CustomStreamWrapper(
                            response,
                            model,
                            custom_llm_provider="bedrock",
                            logging_obj=logging,
                        )
                    else:
                        response = CustomStreamWrapper(
                            iter(response),
                            model,
                            custom_llm_provider="bedrock",
                            logging_obj=logging,
                        )
    
                if optional_params.get("stream", False):
                    ## LOGGING
                    logging.post_call(
                        input=messages,
                        api_key=None,
                        original_response=response,
                    )
    
                ## RESPONSE OBJECT
                response = response
            elif custom_llm_provider == "watsonx":
                custom_prompt_dict = custom_prompt_dict or litellm.custom_prompt_dict
                response = watsonx.IBMWatsonXAI().completion(
                    model=model,
                    messages=messages,
                    custom_prompt_dict=custom_prompt_dict,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,  # type: ignore
                    logger_fn=logger_fn,
                    encoding=encoding,
                    logging_obj=logging,
                    timeout=timeout,  # type: ignore
                )
                if (
                    "stream" in optional_params
                    and optional_params["stream"] == True
                    and not isinstance(response, CustomStreamWrapper)
                ):
                    # don't try to access stream object,
                    response = CustomStreamWrapper(
                        iter(response),
                        model,
                        custom_llm_provider="watsonx",
                        logging_obj=logging,
                    )
    
                if optional_params.get("stream", False):
                    ## LOGGING
                    logging.post_call(
                        input=messages,
                        api_key=None,
                        original_response=response,
                    )
                ## RESPONSE OBJECT
                response = response
            elif custom_llm_provider == "vllm":
                custom_prompt_dict = custom_prompt_dict or litellm.custom_prompt_dict
                model_response = vllm.completion(
                    model=model,
                    messages=messages,
                    custom_prompt_dict=custom_prompt_dict,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,
                    logging_obj=logging,
                )
    
                if (
                    "stream" in optional_params and optional_params["stream"] == True
                ):  ## [BETA]
                    # don't try to access stream object,
                    response = CustomStreamWrapper(
                        model_response,
                        model,
                        custom_llm_provider="vllm",
                        logging_obj=logging,
                    )
                    return response
    
                ## RESPONSE OBJECT
                response = model_response
            elif custom_llm_provider == "ollama":
                api_base = (
                    litellm.api_base
                    or api_base
                    or get_secret("OLLAMA_API_BASE")
                    or "http://localhost:11434"
                )
                custom_prompt_dict = custom_prompt_dict or litellm.custom_prompt_dict
                if model in custom_prompt_dict:
                    # check if the model has a registered custom prompt
                    model_prompt_details = custom_prompt_dict[model]
                    prompt = custom_prompt(
                        role_dict=model_prompt_details["roles"],
                        initial_prompt_value=model_prompt_details["initial_prompt_value"],
                        final_prompt_value=model_prompt_details["final_prompt_value"],
                        messages=messages,
                    )
                else:
                    prompt = prompt_factory(
                        model=model,
                        messages=messages,
                        custom_llm_provider=custom_llm_provider,
                    )
                    if isinstance(prompt, dict):
                        # for multimode models - ollama/llava prompt_factory returns a dict {
                        #     "prompt": prompt,
                        #     "images": images
                        # }
                        prompt, images = prompt["prompt"], prompt["images"]
                        optional_params["images"] = images
    
                ## LOGGING
                generator = ollama.get_ollama_response(
                    api_base,
                    model,
                    prompt,
                    optional_params,
                    logging_obj=logging,
                    acompletion=acompletion,
                    model_response=model_response,
                    encoding=encoding,
                )
                if acompletion is True or optional_params.get("stream", False) == True:
                    return generator
    
                response = generator
            elif custom_llm_provider == "ollama_chat":
                api_base = (
                    litellm.api_base
                    or api_base
                    or get_secret("OLLAMA_API_BASE")
                    or "http://localhost:11434"
                )
    
                api_key = (
                    api_key
                    or litellm.ollama_key
                    or os.environ.get("OLLAMA_API_KEY")
                    or litellm.api_key
                )
                ## LOGGING
                generator = ollama_chat.get_ollama_response(
                    api_base,
                    api_key,
                    model,
                    messages,
                    optional_params,
                    logging_obj=logging,
                    acompletion=acompletion,
                    model_response=model_response,
                    encoding=encoding,
                )
                if acompletion is True or optional_params.get("stream", False) == True:
                    return generator
    
                response = generator
            elif custom_llm_provider == "cloudflare":
                api_key = (
                    api_key
                    or litellm.cloudflare_api_key
                    or litellm.api_key
                    or get_secret("CLOUDFLARE_API_KEY")
                )
                account_id = get_secret("CLOUDFLARE_ACCOUNT_ID")
                api_base = (
                    api_base
                    or litellm.api_base
                    or get_secret("CLOUDFLARE_API_BASE")
                    or f"https://api.cloudflare.com/client/v4/accounts/{account_id}/ai/run/"
                )
    
                custom_prompt_dict = custom_prompt_dict or litellm.custom_prompt_dict
                response = cloudflare.completion(
                    model=model,
                    messages=messages,
                    api_base=api_base,
                    custom_prompt_dict=litellm.custom_prompt_dict,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,  # for calculating input/output tokens
                    api_key=api_key,
                    logging_obj=logging,
                )
                if "stream" in optional_params and optional_params["stream"] == True:
                    # don't try to access stream object,
                    response = CustomStreamWrapper(
                        response,
                        model,
                        custom_llm_provider="cloudflare",
                        logging_obj=logging,
                    )
    
                if optional_params.get("stream", False) or acompletion == True:
                    ## LOGGING
                    logging.post_call(
                        input=messages,
                        api_key=api_key,
                        original_response=response,
                    )
                response = response
            elif (
                custom_llm_provider == "baseten"
                or litellm.api_base == "https://app.baseten.co"
            ):
                custom_llm_provider = "baseten"
                baseten_key = (
                    api_key
                    or litellm.baseten_key
                    or os.environ.get("BASETEN_API_KEY")
                    or litellm.api_key
                )
    
                model_response = baseten.completion(
                    model=model,
                    messages=messages,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,
                    api_key=baseten_key,
                    logging_obj=logging,
                )
                if inspect.isgenerator(model_response) or (
                    "stream" in optional_params and optional_params["stream"] == True
                ):
                    # don't try to access stream object,
                    response = CustomStreamWrapper(
                        model_response,
                        model,
                        custom_llm_provider="baseten",
                        logging_obj=logging,
                    )
                    return response
                response = model_response
            elif custom_llm_provider == "petals" or model in litellm.petals_models:
                api_base = api_base or litellm.api_base
    
                custom_llm_provider = "petals"
                stream = optional_params.pop("stream", False)
                model_response = petals.completion(
                    model=model,
                    messages=messages,
                    api_base=api_base,
                    model_response=model_response,
                    print_verbose=print_verbose,
                    optional_params=optional_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,
                    logging_obj=logging,
                )
                if stream == True:  ## [BETA]
                    # Fake streaming for petals
                    resp_string = model_response["choices"][0]["message"]["content"]
                    response = CustomStreamWrapper(
                        resp_string,
                        model,
                        custom_llm_provider="petals",
                        logging_obj=logging,
                    )
                    return response
                response = model_response
            elif custom_llm_provider == "custom":
                import requests
    
                url = litellm.api_base or api_base or ""
                if url == None or url == "":
                    raise ValueError(
                        "api_base not set. Set api_base or litellm.api_base for custom endpoints"
                    )
    
                """
                assume input to custom LLM api bases follow this format:
                resp = requests.post(
                    api_base,
                    json={
                        'model': 'meta-llama/Llama-2-13b-hf', # model name
                        'params': {
                            'prompt': ["The capital of France is P"],
                            'max_tokens': 32,
                            'temperature': 0.7,
                            'top_p': 1.0,
                            'top_k': 40,
                        }
                    }
                )
    
                """
                prompt = " ".join([message["content"] for message in messages])  # type: ignore
                resp = requests.post(
                    url,
                    json={
                        "model": model,
                        "params": {
                            "prompt": [prompt],
                            "max_tokens": max_tokens,
                            "temperature": temperature,
                            "top_p": top_p,
                            "top_k": kwargs.get("top_k", 40),
                        },
                    },
                )
                response_json = resp.json()
                """
                assume all responses from custom api_bases of this format:
                {
                    'data': [
                        {
                            'prompt': 'The capital of France is P',
                            'output': ['The capital of France is PARIS.\nThe capital of France is PARIS.\nThe capital of France is PARIS.\nThe capital of France is PARIS.\nThe capital of France is PARIS.\nThe capital of France is PARIS.\nThe capital of France is PARIS.\nThe capital of France is PARIS.\nThe capital of France is PARIS.\nThe capital of France is PARIS.\nThe capital of France is PARIS.\nThe capital of France is PARIS.\nThe capital of France is PARIS.\nThe capital of France'],
                            'params': {'temperature': 0.7, 'top_k': 40, 'top_p': 1}}],
                            'message': 'ok'
                        }
                    ]
                }
                """
                string_response = response_json["data"][0]["output"][0]
                ## RESPONSE OBJECT
                model_response["choices"][0]["message"]["content"] = string_response
                model_response["created"] = int(time.time())
                model_response["model"] = model
                response = model_response
            else:
                raise ValueError(
                    f"Unable to map your input to a model. Check your input - {args}"
                )
            return response
        except Exception as e:
            ## Map to OpenAI Exception
>           raise exception_type(
                model=model,
                custom_llm_provider=custom_llm_provider,
                original_exception=e,
                completion_kwargs=args,
                extra_kwargs=kwargs,
            )

../main.py:2287: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

model = 'llama-3-8b-instruct', original_exception = KeyError('stream')
custom_llm_provider = 'predibase'
completion_kwargs = {'acompletion': False, 'api_base': None, 'api_key': 'pb_Qg9YbQo7UqqHdu0ozxN_aw', 'api_version': None, ...}
extra_kwargs = {'api_base': 'https://serving.app.predibase.com', 'litellm_call_id': 'cf0ea464-1b45-4473-8e55-6bf6809df7a7', 'litellm_logging_obj': <litellm.utils.Logging object at 0x10a5cbf50>, 'tenant_id': 'c4768f95'}

    def exception_type(
        model,
        original_exception,
        custom_llm_provider,
        completion_kwargs={},
        extra_kwargs={},
    ):
        global user_logger_fn, liteDebuggerClient
        exception_mapping_worked = False
        if litellm.suppress_debug_info is False:
            print()  # noqa
            print(  # noqa
                "\033[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new\033[0m"  # noqa
            )  # noqa
            print(  # noqa
                "LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'."  # noqa
            )  # noqa
            print()  # noqa
        try:
            if model:
                error_str = str(original_exception)
                if isinstance(original_exception, BaseException):
                    exception_type = type(original_exception).__name__
                else:
                    exception_type = ""
    
                ################################################################################
                # Common Extra information needed for all providers
                # We pass num retries, api_base, vertex_deployment etc to the exception here
                ################################################################################
    
                _api_base = litellm.get_api_base(model=model, optional_params=extra_kwargs)
                messages = litellm.get_first_chars_messages(kwargs=completion_kwargs)
                _vertex_project = extra_kwargs.get("vertex_project")
                _vertex_location = extra_kwargs.get("vertex_location")
                _metadata = extra_kwargs.get("metadata", {}) or {}
                _model_group = _metadata.get("model_group")
                _deployment = _metadata.get("deployment")
                extra_information = f"\nModel: {model}"
                if _api_base:
                    extra_information += f"\nAPI Base: {_api_base}"
                if messages and len(messages) > 0:
                    extra_information += f"\nMessages: {messages}"
    
                if _model_group is not None:
                    extra_information += f"\nmodel_group: {_model_group}\n"
                if _deployment is not None:
                    extra_information += f"\ndeployment: {_deployment}\n"
                if _vertex_project is not None:
                    extra_information += f"\nvertex_project: {_vertex_project}\n"
                if _vertex_location is not None:
                    extra_information += f"\nvertex_location: {_vertex_location}\n"
    
                # on litellm proxy add key name + team to exceptions
                extra_information = _add_key_name_and_team_to_alert(
                    request_info=extra_information, metadata=_metadata
                )
    
                ################################################################################
                # End of Common Extra information Needed for all providers
                ################################################################################
    
                ################################################################################
                #################### Start of Provider Exception mapping ####################
                ################################################################################
    
                if "Request Timeout Error" in error_str or "Request timed out" in error_str:
                    exception_mapping_worked = True
                    raise Timeout(
                        message=f"APITimeoutError - Request timed out. {extra_information} \n error_str: {error_str}",
                        model=model,
                        llm_provider=custom_llm_provider,
                    )
    
                if (
                    custom_llm_provider == "openai"
                    or custom_llm_provider == "text-completion-openai"
                    or custom_llm_provider == "custom_openai"
                    or custom_llm_provider in litellm.openai_compatible_providers
                ):
                    # custom_llm_provider is openai, make it OpenAI
                    if hasattr(original_exception, "message"):
                        message = original_exception.message
                    else:
                        message = str(original_exception)
                    if message is not None and isinstance(message, str):
                        message = message.replace("OPENAI", custom_llm_provider.upper())
                        message = message.replace("openai", custom_llm_provider)
                        message = message.replace("OpenAI", custom_llm_provider)
                    if custom_llm_provider == "openai":
                        exception_provider = "OpenAI" + "Exception"
                    else:
                        exception_provider = (
                            custom_llm_provider[0].upper()
                            + custom_llm_provider[1:]
                            + "Exception"
                        )
    
                    if (
                        "This model's maximum context length is" in error_str
                        or "Request too large" in error_str
                    ):
                        exception_mapping_worked = True
                        raise ContextWindowExceededError(
                            message=f"{exception_provider} - {message} {extra_information}",
                            llm_provider=custom_llm_provider,
                            model=model,
                            response=original_exception.response,
                        )
                    elif (
                        "invalid_request_error" in error_str
                        and "model_not_found" in error_str
                    ):
                        exception_mapping_worked = True
                        raise NotFoundError(
                            message=f"{exception_provider} - {message} {extra_information}",
                            llm_provider=custom_llm_provider,
                            model=model,
                            response=original_exception.response,
                        )
                    elif (
                        "invalid_request_error" in error_str
                        and "content_policy_violation" in error_str
                    ):
                        exception_mapping_worked = True
                        raise ContentPolicyViolationError(
                            message=f"{exception_provider} - {message} {extra_information}",
                            llm_provider=custom_llm_provider,
                            model=model,
                            response=original_exception.response,
                        )
                    elif (
                        "invalid_request_error" in error_str
                        and "Incorrect API key provided" not in error_str
                    ):
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"{exception_provider} - {message} {extra_information}",
                            llm_provider=custom_llm_provider,
                            model=model,
                            response=original_exception.response,
                        )
                    elif (
                        "The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable"
                        in error_str
                    ):
                        exception_mapping_worked = True
                        raise AuthenticationError(
                            message=f"{exception_provider} - {message} {extra_information}",
                            llm_provider=custom_llm_provider,
                            model=model,
                            response=original_exception.response,
                        )
                    elif "Mistral API raised a streaming error" in error_str:
                        exception_mapping_worked = True
                        _request = httpx.Request(
                            method="POST", url="https://api.openai.com/v1"
                        )
                        raise APIError(
                            status_code=500,
                            message=f"{exception_provider} - {message} {extra_information}",
                            llm_provider=custom_llm_provider,
                            model=model,
                            request=_request,
                        )
                    elif hasattr(original_exception, "status_code"):
                        exception_mapping_worked = True
                        if original_exception.status_code == 401:
                            exception_mapping_worked = True
                            raise AuthenticationError(
                                message=f"{exception_provider} - {message} {extra_information}",
                                llm_provider=custom_llm_provider,
                                model=model,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 404:
                            exception_mapping_worked = True
                            raise NotFoundError(
                                message=f"{exception_provider} - {message} {extra_information}",
                                model=model,
                                llm_provider=custom_llm_provider,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 408:
                            exception_mapping_worked = True
                            raise Timeout(
                                message=f"{exception_provider} - {message} {extra_information}",
                                model=model,
                                llm_provider=custom_llm_provider,
                            )
                        elif original_exception.status_code == 422:
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"{exception_provider} - {message} {extra_information}",
                                model=model,
                                llm_provider=custom_llm_provider,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 429:
                            exception_mapping_worked = True
                            raise RateLimitError(
                                message=f"{exception_provider} - {message} {extra_information}",
                                model=model,
                                llm_provider=custom_llm_provider,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 503:
                            exception_mapping_worked = True
                            raise ServiceUnavailableError(
                                message=f"{exception_provider} - {message} {extra_information}",
                                model=model,
                                llm_provider=custom_llm_provider,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 504:  # gateway timeout error
                            exception_mapping_worked = True
                            raise Timeout(
                                message=f"{exception_provider} - {message} {extra_information}",
                                model=model,
                                llm_provider=custom_llm_provider,
                            )
                        else:
                            exception_mapping_worked = True
                            raise APIError(
                                status_code=original_exception.status_code,
                                message=f"{exception_provider} - {message} {extra_information}",
                                llm_provider=custom_llm_provider,
                                model=model,
                                request=original_exception.request,
                            )
                    else:
                        # if no status code then it is an APIConnectionError: https://github.com/openai/openai-python#handling-errors
                        raise APIConnectionError(
                            message=f"{exception_provider} - {message} {extra_information}",
                            llm_provider=custom_llm_provider,
                            model=model,
                            request=httpx.Request(
                                method="POST", url="https://api.openai.com/v1/"
                            ),
                        )
                elif custom_llm_provider == "anthropic":  # one of the anthropics
                    if hasattr(original_exception, "message"):
                        if (
                            "prompt is too long" in original_exception.message
                            or "prompt: length" in original_exception.message
                        ):
                            exception_mapping_worked = True
                            raise ContextWindowExceededError(
                                message=original_exception.message,
                                model=model,
                                llm_provider="anthropic",
                                response=original_exception.response,
                            )
                        if "Invalid API Key" in original_exception.message:
                            exception_mapping_worked = True
                            raise AuthenticationError(
                                message=original_exception.message,
                                model=model,
                                llm_provider="anthropic",
                                response=original_exception.response,
                            )
                    if hasattr(original_exception, "status_code"):
                        print_verbose(f"status_code: {original_exception.status_code}")
                        if original_exception.status_code == 401:
                            exception_mapping_worked = True
                            raise AuthenticationError(
                                message=f"AnthropicException - {original_exception.message}",
                                llm_provider="anthropic",
                                model=model,
                                response=original_exception.response,
                            )
                        elif (
                            original_exception.status_code == 400
                            or original_exception.status_code == 413
                        ):
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"AnthropicException - {original_exception.message}",
                                model=model,
                                llm_provider="anthropic",
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 408:
                            exception_mapping_worked = True
                            raise Timeout(
                                message=f"AnthropicException - {original_exception.message}",
                                model=model,
                                llm_provider="anthropic",
                            )
                        elif original_exception.status_code == 429:
                            exception_mapping_worked = True
                            raise RateLimitError(
                                message=f"AnthropicException - {original_exception.message}",
                                llm_provider="anthropic",
                                model=model,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 500:
                            exception_mapping_worked = True
                            raise APIError(
                                status_code=500,
                                message=f"AnthropicException - {original_exception.message}. Handle with `litellm.APIError`.",
                                llm_provider="anthropic",
                                model=model,
                                request=original_exception.request,
                            )
                elif custom_llm_provider == "replicate":
                    if "Incorrect authentication token" in error_str:
                        exception_mapping_worked = True
                        raise AuthenticationError(
                            message=f"ReplicateException - {error_str}",
                            llm_provider="replicate",
                            model=model,
                            response=original_exception.response,
                        )
                    elif "input is too long" in error_str:
                        exception_mapping_worked = True
                        raise ContextWindowExceededError(
                            message=f"ReplicateException - {error_str}",
                            model=model,
                            llm_provider="replicate",
                            response=original_exception.response,
                        )
                    elif exception_type == "ModelError":
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"ReplicateException - {error_str}",
                            model=model,
                            llm_provider="replicate",
                            response=original_exception.response,
                        )
                    elif "Request was throttled" in error_str:
                        exception_mapping_worked = True
                        raise RateLimitError(
                            message=f"ReplicateException - {error_str}",
                            llm_provider="replicate",
                            model=model,
                            response=original_exception.response,
                        )
                    elif hasattr(original_exception, "status_code"):
                        if original_exception.status_code == 401:
                            exception_mapping_worked = True
                            raise AuthenticationError(
                                message=f"ReplicateException - {original_exception.message}",
                                llm_provider="replicate",
                                model=model,
                                response=original_exception.response,
                            )
                        elif (
                            original_exception.status_code == 400
                            or original_exception.status_code == 422
                            or original_exception.status_code == 413
                        ):
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"ReplicateException - {original_exception.message}",
                                model=model,
                                llm_provider="replicate",
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 408:
                            exception_mapping_worked = True
                            raise Timeout(
                                message=f"ReplicateException - {original_exception.message}",
                                model=model,
                                llm_provider="replicate",
                            )
                        elif original_exception.status_code == 429:
                            exception_mapping_worked = True
                            raise RateLimitError(
                                message=f"ReplicateException - {original_exception.message}",
                                llm_provider="replicate",
                                model=model,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 500:
                            exception_mapping_worked = True
                            raise ServiceUnavailableError(
                                message=f"ReplicateException - {original_exception.message}",
                                llm_provider="replicate",
                                model=model,
                                response=original_exception.response,
                            )
                    exception_mapping_worked = True
                    raise APIError(
                        status_code=500,
                        message=f"ReplicateException - {str(original_exception)}",
                        llm_provider="replicate",
                        model=model,
                        request=original_exception.request,
                    )
                elif custom_llm_provider == "bedrock":
                    if (
                        "too many tokens" in error_str
                        or "expected maxLength:" in error_str
                        or "Input is too long" in error_str
                        or "prompt: length: 1.." in error_str
                        or "Too many input tokens" in error_str
                    ):
                        exception_mapping_worked = True
                        raise ContextWindowExceededError(
                            message=f"BedrockException: Context Window Error - {error_str}",
                            model=model,
                            llm_provider="bedrock",
                            response=original_exception.response,
                        )
                    if "Malformed input request" in error_str:
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"BedrockException - {error_str}",
                            model=model,
                            llm_provider="bedrock",
                            response=original_exception.response,
                        )
                    if (
                        "Unable to locate credentials" in error_str
                        or "The security token included in the request is invalid"
                        in error_str
                    ):
                        exception_mapping_worked = True
                        raise AuthenticationError(
                            message=f"BedrockException Invalid Authentication - {error_str}",
                            model=model,
                            llm_provider="bedrock",
                            response=original_exception.response,
                        )
                    if "AccessDeniedException" in error_str:
                        exception_mapping_worked = True
                        raise PermissionDeniedError(
                            message=f"BedrockException PermissionDeniedError - {error_str}",
                            model=model,
                            llm_provider="bedrock",
                            response=original_exception.response,
                        )
                    if (
                        "throttlingException" in error_str
                        or "ThrottlingException" in error_str
                    ):
                        exception_mapping_worked = True
                        raise RateLimitError(
                            message=f"BedrockException: Rate Limit Error - {error_str}",
                            model=model,
                            llm_provider="bedrock",
                            response=original_exception.response,
                        )
                    if "Connect timeout on endpoint URL" in error_str:
                        exception_mapping_worked = True
                        raise Timeout(
                            message=f"BedrockException: Timeout Error - {error_str}",
                            model=model,
                            llm_provider="bedrock",
                        )
                    if hasattr(original_exception, "status_code"):
                        if original_exception.status_code == 500:
                            exception_mapping_worked = True
                            raise ServiceUnavailableError(
                                message=f"BedrockException - {original_exception.message}",
                                llm_provider="bedrock",
                                model=model,
                                response=httpx.Response(
                                    status_code=500,
                                    request=httpx.Request(
                                        method="POST", url="https://api.openai.com/v1/"
                                    ),
                                ),
                            )
                        elif original_exception.status_code == 401:
                            exception_mapping_worked = True
                            raise AuthenticationError(
                                message=f"BedrockException - {original_exception.message}",
                                llm_provider="bedrock",
                                model=model,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 400:
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"BedrockException - {original_exception.message}",
                                llm_provider="bedrock",
                                model=model,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 404:
                            exception_mapping_worked = True
                            raise NotFoundError(
                                message=f"BedrockException - {original_exception.message}",
                                llm_provider="bedrock",
                                model=model,
                                response=original_exception.response,
                            )
                elif custom_llm_provider == "sagemaker":
                    if "Unable to locate credentials" in error_str:
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"SagemakerException - {error_str}",
                            model=model,
                            llm_provider="sagemaker",
                            response=original_exception.response,
                        )
                    elif (
                        "Input validation error: `best_of` must be > 0 and <= 2"
                        in error_str
                    ):
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"SagemakerException - the value of 'n' must be > 0 and <= 2 for sagemaker endpoints",
                            model=model,
                            llm_provider="sagemaker",
                            response=original_exception.response,
                        )
                    elif (
                        "`inputs` tokens + `max_new_tokens` must be <=" in error_str
                        or "instance type with more CPU capacity or memory" in error_str
                    ):
                        exception_mapping_worked = True
                        raise ContextWindowExceededError(
                            message=f"SagemakerException - {error_str}",
                            model=model,
                            llm_provider="sagemaker",
                            response=original_exception.response,
                        )
                elif custom_llm_provider == "vertex_ai":
                    if (
                        "Vertex AI API has not been used in project" in error_str
                        or "Unable to find your project" in error_str
                    ):
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"VertexAIException - {error_str} {extra_information}",
                            model=model,
                            llm_provider="vertex_ai",
                            response=original_exception.response,
                        )
                    elif (
                        "None Unknown Error." in error_str
                        or "Content has no parts." in error_str
                    ):
                        exception_mapping_worked = True
                        raise APIError(
                            message=f"VertexAIException - {error_str} {extra_information}",
                            status_code=500,
                            model=model,
                            llm_provider="vertex_ai",
                            request=original_exception.request,
                        )
                    elif "403" in error_str:
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"VertexAIException - {error_str} {extra_information}",
                            model=model,
                            llm_provider="vertex_ai",
                            response=original_exception.response,
                        )
                    elif "The response was blocked." in error_str:
                        exception_mapping_worked = True
                        raise UnprocessableEntityError(
                            message=f"VertexAIException - {error_str} {extra_information}",
                            model=model,
                            llm_provider="vertex_ai",
                            response=httpx.Response(
                                status_code=429,
                                request=httpx.Request(
                                    method="POST",
                                    url=" https://cloud.google.com/vertex-ai/",
                                ),
                            ),
                        )
                    elif (
                        "429 Quota exceeded" in error_str
                        or "IndexError: list index out of range" in error_str
                        or "429 Unable to submit request because the service is temporarily out of capacity."
                        in error_str
                    ):
                        exception_mapping_worked = True
                        raise RateLimitError(
                            message=f"VertexAIException - {error_str} {extra_information}",
                            model=model,
                            llm_provider="vertex_ai",
                            response=httpx.Response(
                                status_code=429,
                                request=httpx.Request(
                                    method="POST",
                                    url=" https://cloud.google.com/vertex-ai/",
                                ),
                            ),
                        )
                    if hasattr(original_exception, "status_code"):
                        if original_exception.status_code == 400:
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"VertexAIException - {error_str} {extra_information}",
                                model=model,
                                llm_provider="vertex_ai",
                                response=original_exception.response,
                            )
                        if original_exception.status_code == 500:
                            exception_mapping_worked = True
                            raise APIError(
                                message=f"VertexAIException - {error_str} {extra_information}",
                                status_code=500,
                                model=model,
                                llm_provider="vertex_ai",
                                request=original_exception.request,
                            )
                elif custom_llm_provider == "palm" or custom_llm_provider == "gemini":
                    if "503 Getting metadata" in error_str:
                        # auth errors look like this
                        # 503 Getting metadata from plugin failed with error: Reauthentication is needed. Please run `gcloud auth application-default login` to reauthenticate.
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"GeminiException - Invalid api key",
                            model=model,
                            llm_provider="palm",
                            response=original_exception.response,
                        )
                    if (
                        "504 Deadline expired before operation could complete." in error_str
                        or "504 Deadline Exceeded" in error_str
                    ):
                        exception_mapping_worked = True
                        raise Timeout(
                            message=f"GeminiException - {original_exception.message}",
                            model=model,
                            llm_provider="palm",
                        )
                    if "400 Request payload size exceeds" in error_str:
                        exception_mapping_worked = True
                        raise ContextWindowExceededError(
                            message=f"GeminiException - {error_str}",
                            model=model,
                            llm_provider="palm",
                            response=original_exception.response,
                        )
                    if (
                        "500 An internal error has occurred." in error_str
                        or "list index out of range" in error_str
                    ):
                        exception_mapping_worked = True
                        raise APIError(
                            status_code=getattr(original_exception, "status_code", 500),
                            message=f"GeminiException - {original_exception.message}",
                            llm_provider="palm",
                            model=model,
                            request=httpx.Response(
                                status_code=429,
                                request=httpx.Request(
                                    method="POST",
                                    url=" https://cloud.google.com/vertex-ai/",
                                ),
                            ),
                        )
                    if hasattr(original_exception, "status_code"):
                        if original_exception.status_code == 400:
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"GeminiException - {error_str}",
                                model=model,
                                llm_provider="palm",
                                response=original_exception.response,
                            )
                    # Dailed: Error occurred: 400 Request payload size exceeds the limit: 20000 bytes
                elif custom_llm_provider == "cloudflare":
                    if "Authentication error" in error_str:
                        exception_mapping_worked = True
                        raise AuthenticationError(
                            message=f"Cloudflare Exception - {original_exception.message}",
                            llm_provider="cloudflare",
                            model=model,
                            response=original_exception.response,
                        )
                    if "must have required property" in error_str:
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"Cloudflare Exception - {original_exception.message}",
                            llm_provider="cloudflare",
                            model=model,
                            response=original_exception.response,
                        )
                elif (
                    custom_llm_provider == "cohere" or custom_llm_provider == "cohere_chat"
                ):  # Cohere
                    if (
                        "invalid api token" in error_str
                        or "No API key provided." in error_str
                    ):
                        exception_mapping_worked = True
                        raise AuthenticationError(
                            message=f"CohereException - {original_exception.message}",
                            llm_provider="cohere",
                            model=model,
                            response=original_exception.response,
                        )
                    elif "too many tokens" in error_str:
                        exception_mapping_worked = True
                        raise ContextWindowExceededError(
                            message=f"CohereException - {original_exception.message}",
                            model=model,
                            llm_provider="cohere",
                            response=original_exception.response,
                        )
                    elif hasattr(original_exception, "status_code"):
                        if (
                            original_exception.status_code == 400
                            or original_exception.status_code == 498
                        ):
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"CohereException - {original_exception.message}",
                                llm_provider="cohere",
                                model=model,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 500:
                            exception_mapping_worked = True
                            raise ServiceUnavailableError(
                                message=f"CohereException - {original_exception.message}",
                                llm_provider="cohere",
                                model=model,
                                response=original_exception.response,
                            )
                    elif (
                        "CohereConnectionError" in exception_type
                    ):  # cohere seems to fire these errors when we load test it (1k+ messages / min)
                        exception_mapping_worked = True
                        raise RateLimitError(
                            message=f"CohereException - {original_exception.message}",
                            llm_provider="cohere",
                            model=model,
                            response=original_exception.response,
                        )
                    elif "invalid type:" in error_str:
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"CohereException - {original_exception.message}",
                            llm_provider="cohere",
                            model=model,
                            response=original_exception.response,
                        )
                    elif "Unexpected server error" in error_str:
                        exception_mapping_worked = True
                        raise ServiceUnavailableError(
                            message=f"CohereException - {original_exception.message}",
                            llm_provider="cohere",
                            model=model,
                            response=original_exception.response,
                        )
                    else:
                        if hasattr(original_exception, "status_code"):
                            exception_mapping_worked = True
                            raise APIError(
                                status_code=original_exception.status_code,
                                message=f"CohereException - {original_exception.message}",
                                llm_provider="cohere",
                                model=model,
                                request=original_exception.request,
                            )
                        raise original_exception
                elif custom_llm_provider == "huggingface":
                    if "length limit exceeded" in error_str:
                        exception_mapping_worked = True
                        raise ContextWindowExceededError(
                            message=error_str,
                            model=model,
                            llm_provider="huggingface",
                            response=original_exception.response,
                        )
                    elif "A valid user token is required" in error_str:
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=error_str,
                            llm_provider="huggingface",
                            model=model,
                            response=original_exception.response,
                        )
                    if hasattr(original_exception, "status_code"):
                        if original_exception.status_code == 401:
                            exception_mapping_worked = True
                            raise AuthenticationError(
                                message=f"HuggingfaceException - {original_exception.message}",
                                llm_provider="huggingface",
                                model=model,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 400:
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"HuggingfaceException - {original_exception.message}",
                                model=model,
                                llm_provider="huggingface",
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 408:
                            exception_mapping_worked = True
                            raise Timeout(
                                message=f"HuggingfaceException - {original_exception.message}",
                                model=model,
                                llm_provider="huggingface",
                            )
                        elif original_exception.status_code == 429:
                            exception_mapping_worked = True
                            raise RateLimitError(
                                message=f"HuggingfaceException - {original_exception.message}",
                                llm_provider="huggingface",
                                model=model,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 503:
                            exception_mapping_worked = True
                            raise ServiceUnavailableError(
                                message=f"HuggingfaceException - {original_exception.message}",
                                llm_provider="huggingface",
                                model=model,
                                response=original_exception.response,
                            )
                        else:
                            exception_mapping_worked = True
                            raise APIError(
                                status_code=original_exception.status_code,
                                message=f"HuggingfaceException - {original_exception.message}",
                                llm_provider="huggingface",
                                model=model,
                                request=original_exception.request,
                            )
                elif custom_llm_provider == "ai21":
                    if hasattr(original_exception, "message"):
                        if "Prompt has too many tokens" in original_exception.message:
                            exception_mapping_worked = True
                            raise ContextWindowExceededError(
                                message=f"AI21Exception - {original_exception.message}",
                                model=model,
                                llm_provider="ai21",
                                response=original_exception.response,
                            )
                        if "Bad or missing API token." in original_exception.message:
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"AI21Exception - {original_exception.message}",
                                model=model,
                                llm_provider="ai21",
                                response=original_exception.response,
                            )
                    if hasattr(original_exception, "status_code"):
                        if original_exception.status_code == 401:
                            exception_mapping_worked = True
                            raise AuthenticationError(
                                message=f"AI21Exception - {original_exception.message}",
                                llm_provider="ai21",
                                model=model,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 408:
                            exception_mapping_worked = True
                            raise Timeout(
                                message=f"AI21Exception - {original_exception.message}",
                                model=model,
                                llm_provider="ai21",
                            )
                        if original_exception.status_code == 422:
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"AI21Exception - {original_exception.message}",
                                model=model,
                                llm_provider="ai21",
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 429:
                            exception_mapping_worked = True
                            raise RateLimitError(
                                message=f"AI21Exception - {original_exception.message}",
                                llm_provider="ai21",
                                model=model,
                                response=original_exception.response,
                            )
                        else:
                            exception_mapping_worked = True
                            raise APIError(
                                status_code=original_exception.status_code,
                                message=f"AI21Exception - {original_exception.message}",
                                llm_provider="ai21",
                                model=model,
                                request=original_exception.request,
                            )
                elif custom_llm_provider == "nlp_cloud":
                    if "detail" in error_str:
                        if "Input text length should not exceed" in error_str:
                            exception_mapping_worked = True
                            raise ContextWindowExceededError(
                                message=f"NLPCloudException - {error_str}",
                                model=model,
                                llm_provider="nlp_cloud",
                                response=original_exception.response,
                            )
                        elif "value is not a valid" in error_str:
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"NLPCloudException - {error_str}",
                                model=model,
                                llm_provider="nlp_cloud",
                                response=original_exception.response,
                            )
                        else:
                            exception_mapping_worked = True
                            raise APIError(
                                status_code=500,
                                message=f"NLPCloudException - {error_str}",
                                model=model,
                                llm_provider="nlp_cloud",
                                request=original_exception.request,
                            )
                    if hasattr(
                        original_exception, "status_code"
                    ):  # https://docs.nlpcloud.com/?shell#errors
                        if (
                            original_exception.status_code == 400
                            or original_exception.status_code == 406
                            or original_exception.status_code == 413
                            or original_exception.status_code == 422
                        ):
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"NLPCloudException - {original_exception.message}",
                                llm_provider="nlp_cloud",
                                model=model,
                                response=original_exception.response,
                            )
                        elif (
                            original_exception.status_code == 401
                            or original_exception.status_code == 403
                        ):
                            exception_mapping_worked = True
                            raise AuthenticationError(
                                message=f"NLPCloudException - {original_exception.message}",
                                llm_provider="nlp_cloud",
                                model=model,
                                response=original_exception.response,
                            )
                        elif (
                            original_exception.status_code == 522
                            or original_exception.status_code == 524
                        ):
                            exception_mapping_worked = True
                            raise Timeout(
                                message=f"NLPCloudException - {original_exception.message}",
                                model=model,
                                llm_provider="nlp_cloud",
                            )
                        elif (
                            original_exception.status_code == 429
                            or original_exception.status_code == 402
                        ):
                            exception_mapping_worked = True
                            raise RateLimitError(
                                message=f"NLPCloudException - {original_exception.message}",
                                llm_provider="nlp_cloud",
                                model=model,
                                response=original_exception.response,
                            )
                        elif (
                            original_exception.status_code == 500
                            or original_exception.status_code == 503
                        ):
                            exception_mapping_worked = True
                            raise APIError(
                                status_code=original_exception.status_code,
                                message=f"NLPCloudException - {original_exception.message}",
                                llm_provider="nlp_cloud",
                                model=model,
                                request=original_exception.request,
                            )
                        elif (
                            original_exception.status_code == 504
                            or original_exception.status_code == 520
                        ):
                            exception_mapping_worked = True
                            raise ServiceUnavailableError(
                                message=f"NLPCloudException - {original_exception.message}",
                                model=model,
                                llm_provider="nlp_cloud",
                                response=original_exception.response,
                            )
                        else:
                            exception_mapping_worked = True
                            raise APIError(
                                status_code=original_exception.status_code,
                                message=f"NLPCloudException - {original_exception.message}",
                                llm_provider="nlp_cloud",
                                model=model,
                                request=original_exception.request,
                            )
                elif custom_llm_provider == "together_ai":
                    import json
    
                    try:
                        error_response = json.loads(error_str)
                    except:
                        error_response = {"error": error_str}
                    if (
                        "error" in error_response
                        and "`inputs` tokens + `max_new_tokens` must be <="
                        in error_response["error"]
                    ):
                        exception_mapping_worked = True
                        raise ContextWindowExceededError(
                            message=f"TogetherAIException - {error_response['error']}",
                            model=model,
                            llm_provider="together_ai",
                            response=original_exception.response,
                        )
                    elif (
                        "error" in error_response
                        and "invalid private key" in error_response["error"]
                    ):
                        exception_mapping_worked = True
                        raise AuthenticationError(
                            message=f"TogetherAIException - {error_response['error']}",
                            llm_provider="together_ai",
                            model=model,
                            response=original_exception.response,
                        )
                    elif (
                        "error" in error_response
                        and "INVALID_ARGUMENT" in error_response["error"]
                    ):
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"TogetherAIException - {error_response['error']}",
                            model=model,
                            llm_provider="together_ai",
                            response=original_exception.response,
                        )
    
                    elif (
                        "error" in error_response
                        and "API key doesn't match expected format."
                        in error_response["error"]
                    ):
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"TogetherAIException - {error_response['error']}",
                            model=model,
                            llm_provider="together_ai",
                            response=original_exception.response,
                        )
                    elif (
                        "error_type" in error_response
                        and error_response["error_type"] == "validation"
                    ):
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"TogetherAIException - {error_response['error']}",
                            model=model,
                            llm_provider="together_ai",
                            response=original_exception.response,
                        )
                    if hasattr(original_exception, "status_code"):
                        if original_exception.status_code == 408:
                            exception_mapping_worked = True
                            raise Timeout(
                                message=f"TogetherAIException - {original_exception.message}",
                                model=model,
                                llm_provider="together_ai",
                            )
                        elif original_exception.status_code == 422:
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"TogetherAIException - {error_response['error']}",
                                model=model,
                                llm_provider="together_ai",
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 429:
                            exception_mapping_worked = True
                            raise RateLimitError(
                                message=f"TogetherAIException - {original_exception.message}",
                                llm_provider="together_ai",
                                model=model,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 524:
                            exception_mapping_worked = True
                            raise Timeout(
                                message=f"TogetherAIException - {original_exception.message}",
                                llm_provider="together_ai",
                                model=model,
                            )
                    else:
                        exception_mapping_worked = True
                        raise APIError(
                            status_code=original_exception.status_code,
                            message=f"TogetherAIException - {original_exception.message}",
                            llm_provider="together_ai",
                            model=model,
                            request=original_exception.request,
                        )
                elif custom_llm_provider == "aleph_alpha":
                    if (
                        "This is longer than the model's maximum context length"
                        in error_str
                    ):
                        exception_mapping_worked = True
                        raise ContextWindowExceededError(
                            message=f"AlephAlphaException - {original_exception.message}",
                            llm_provider="aleph_alpha",
                            model=model,
                            response=original_exception.response,
                        )
                    elif "InvalidToken" in error_str or "No token provided" in error_str:
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"AlephAlphaException - {original_exception.message}",
                            llm_provider="aleph_alpha",
                            model=model,
                            response=original_exception.response,
                        )
                    elif hasattr(original_exception, "status_code"):
                        print_verbose(f"status code: {original_exception.status_code}")
                        if original_exception.status_code == 401:
                            exception_mapping_worked = True
                            raise AuthenticationError(
                                message=f"AlephAlphaException - {original_exception.message}",
                                llm_provider="aleph_alpha",
                                model=model,
                            )
                        elif original_exception.status_code == 400:
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"AlephAlphaException - {original_exception.message}",
                                llm_provider="aleph_alpha",
                                model=model,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 429:
                            exception_mapping_worked = True
                            raise RateLimitError(
                                message=f"AlephAlphaException - {original_exception.message}",
                                llm_provider="aleph_alpha",
                                model=model,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 500:
                            exception_mapping_worked = True
                            raise ServiceUnavailableError(
                                message=f"AlephAlphaException - {original_exception.message}",
                                llm_provider="aleph_alpha",
                                model=model,
                                response=original_exception.response,
                            )
                        raise original_exception
                    raise original_exception
                elif (
                    custom_llm_provider == "ollama" or custom_llm_provider == "ollama_chat"
                ):
                    if isinstance(original_exception, dict):
                        error_str = original_exception.get("error", "")
                    else:
                        error_str = str(original_exception)
                    if "no such file or directory" in error_str:
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"OllamaException: Invalid Model/Model not loaded - {original_exception}",
                            model=model,
                            llm_provider="ollama",
                            response=original_exception.response,
                        )
                    elif "Failed to establish a new connection" in error_str:
                        exception_mapping_worked = True
                        raise ServiceUnavailableError(
                            message=f"OllamaException: {original_exception}",
                            llm_provider="ollama",
                            model=model,
                            response=original_exception.response,
                        )
                    elif "Invalid response object from API" in error_str:
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"OllamaException: {original_exception}",
                            llm_provider="ollama",
                            model=model,
                            response=original_exception.response,
                        )
                    elif "Read timed out" in error_str:
                        exception_mapping_worked = True
                        raise Timeout(
                            message=f"OllamaException: {original_exception}",
                            llm_provider="ollama",
                            model=model,
                        )
                elif custom_llm_provider == "vllm":
                    if hasattr(original_exception, "status_code"):
                        if original_exception.status_code == 0:
                            exception_mapping_worked = True
                            raise APIConnectionError(
                                message=f"VLLMException - {original_exception.message}",
                                llm_provider="vllm",
                                model=model,
                                request=original_exception.request,
                            )
                elif custom_llm_provider == "azure":
                    if "Internal server error" in error_str:
                        exception_mapping_worked = True
                        raise APIError(
                            status_code=500,
                            message=f"AzureException - {original_exception.message} {extra_information}",
                            llm_provider="azure",
                            model=model,
                            request=httpx.Request(method="POST", url="https://openai.com/"),
                        )
                    elif "This model's maximum context length is" in error_str:
                        exception_mapping_worked = True
                        raise ContextWindowExceededError(
                            message=f"AzureException - {original_exception.message} {extra_information}",
                            llm_provider="azure",
                            model=model,
                            response=original_exception.response,
                        )
                    elif "DeploymentNotFound" in error_str:
                        exception_mapping_worked = True
                        raise NotFoundError(
                            message=f"AzureException - {original_exception.message} {extra_information}",
                            llm_provider="azure",
                            model=model,
                            response=original_exception.response,
                        )
                    elif (
                        "invalid_request_error" in error_str
                        and "content_policy_violation" in error_str
                    ) or (
                        "The response was filtered due to the prompt triggering Azure OpenAI's content management"
                        in error_str
                    ):
                        exception_mapping_worked = True
                        raise ContentPolicyViolationError(
                            message=f"AzureException - {original_exception.message} {extra_information}",
                            llm_provider="azure",
                            model=model,
                            response=original_exception.response,
                        )
                    elif "invalid_request_error" in error_str:
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"AzureException - {original_exception.message} {extra_information}",
                            llm_provider="azure",
                            model=model,
                            response=original_exception.response,
                        )
                    elif (
                        "The api_key client option must be set either by passing api_key to the client or by setting"
                        in error_str
                    ):
                        exception_mapping_worked = True
                        raise AuthenticationError(
                            message=f"{exception_provider} - {original_exception.message} {extra_information}",
                            llm_provider=custom_llm_provider,
                            model=model,
                            response=original_exception.response,
                        )
                    elif hasattr(original_exception, "status_code"):
                        exception_mapping_worked = True
                        if original_exception.status_code == 401:
                            exception_mapping_worked = True
                            raise AuthenticationError(
                                message=f"AzureException - {original_exception.message} {extra_information}",
                                llm_provider="azure",
                                model=model,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 408:
                            exception_mapping_worked = True
                            raise Timeout(
                                message=f"AzureException - {original_exception.message} {extra_information}",
                                model=model,
                                llm_provider="azure",
                            )
                        if original_exception.status_code == 422:
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"AzureException - {original_exception.message} {extra_information}",
                                model=model,
                                llm_provider="azure",
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 429:
                            exception_mapping_worked = True
                            raise RateLimitError(
                                message=f"AzureException - {original_exception.message} {extra_information}",
                                model=model,
                                llm_provider="azure",
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 503:
                            exception_mapping_worked = True
                            raise ServiceUnavailableError(
                                message=f"AzureException - {original_exception.message} {extra_information}",
                                model=model,
                                llm_provider="azure",
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 504:  # gateway timeout error
                            exception_mapping_worked = True
                            raise Timeout(
                                message=f"AzureException - {original_exception.message} {extra_information}",
                                model=model,
                                llm_provider="azure",
                            )
                        else:
                            exception_mapping_worked = True
                            raise APIError(
                                status_code=original_exception.status_code,
                                message=f"AzureException - {original_exception.message} {extra_information}",
                                llm_provider="azure",
                                model=model,
                                request=httpx.Request(
                                    method="POST", url="https://openai.com/"
                                ),
                            )
                    else:
                        # if no status code then it is an APIConnectionError: https://github.com/openai/openai-python#handling-errors
                        raise APIConnectionError(
                            message=f"{exception_provider} - {message} {extra_information}",
                            llm_provider="azure",
                            model=model,
                            request=httpx.Request(method="POST", url="https://openai.com/"),
                        )
            if (
                "BadRequestError.__init__() missing 1 required positional argument: 'param'"
                in str(original_exception)
            ):  # deal with edge-case invalid request error bug in openai-python sdk
                exception_mapping_worked = True
                raise BadRequestError(
                    message=f"{exception_provider}: This can happen due to missing AZURE_API_VERSION: {str(original_exception)}",
                    model=model,
                    llm_provider=custom_llm_provider,
                    response=original_exception.response,
                )
            else:  # ensure generic errors always return APIConnectionError=
                exception_mapping_worked = True
                if hasattr(original_exception, "request"):
                    raise APIConnectionError(
                        message=f"{str(original_exception)}",
                        llm_provider=custom_llm_provider,
                        model=model,
                        request=original_exception.request,
                    )
                else:
                    raise APIConnectionError(
                        message=f"{str(original_exception)}",
                        llm_provider=custom_llm_provider,
                        model=model,
                        request=httpx.Request(
                            method="POST", url="https://api.openai.com/v1/"
                        ),  # stub the request
                    )
        except Exception as e:
            # LOGGING
            exception_logging(
                logger_fn=user_logger_fn,
                additional_args={
                    "exception_mapping_worked": exception_mapping_worked,
                    "original_exception": original_exception,
                },
                exception=e,
            )
            ## AUTH ERROR
            if isinstance(e, AuthenticationError) and (
                litellm.email or "LITELLM_EMAIL" in os.environ
            ):
                threading.Thread(target=get_all_keys, args=(e.llm_provider,)).start()
            # don't let an error with mapping interrupt the user from receiving an error from the llm api calls
            if exception_mapping_worked:
>               raise e

../utils.py:9353: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

model = 'llama-3-8b-instruct', original_exception = KeyError('stream')
custom_llm_provider = 'predibase'
completion_kwargs = {'acompletion': False, 'api_base': None, 'api_key': 'pb_Qg9YbQo7UqqHdu0ozxN_aw', 'api_version': None, ...}
extra_kwargs = {'api_base': 'https://serving.app.predibase.com', 'litellm_call_id': 'cf0ea464-1b45-4473-8e55-6bf6809df7a7', 'litellm_logging_obj': <litellm.utils.Logging object at 0x10a5cbf50>, 'tenant_id': 'c4768f95'}

    def exception_type(
        model,
        original_exception,
        custom_llm_provider,
        completion_kwargs={},
        extra_kwargs={},
    ):
        global user_logger_fn, liteDebuggerClient
        exception_mapping_worked = False
        if litellm.suppress_debug_info is False:
            print()  # noqa
            print(  # noqa
                "\033[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new\033[0m"  # noqa
            )  # noqa
            print(  # noqa
                "LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'."  # noqa
            )  # noqa
            print()  # noqa
        try:
            if model:
                error_str = str(original_exception)
                if isinstance(original_exception, BaseException):
                    exception_type = type(original_exception).__name__
                else:
                    exception_type = ""
    
                ################################################################################
                # Common Extra information needed for all providers
                # We pass num retries, api_base, vertex_deployment etc to the exception here
                ################################################################################
    
                _api_base = litellm.get_api_base(model=model, optional_params=extra_kwargs)
                messages = litellm.get_first_chars_messages(kwargs=completion_kwargs)
                _vertex_project = extra_kwargs.get("vertex_project")
                _vertex_location = extra_kwargs.get("vertex_location")
                _metadata = extra_kwargs.get("metadata", {}) or {}
                _model_group = _metadata.get("model_group")
                _deployment = _metadata.get("deployment")
                extra_information = f"\nModel: {model}"
                if _api_base:
                    extra_information += f"\nAPI Base: {_api_base}"
                if messages and len(messages) > 0:
                    extra_information += f"\nMessages: {messages}"
    
                if _model_group is not None:
                    extra_information += f"\nmodel_group: {_model_group}\n"
                if _deployment is not None:
                    extra_information += f"\ndeployment: {_deployment}\n"
                if _vertex_project is not None:
                    extra_information += f"\nvertex_project: {_vertex_project}\n"
                if _vertex_location is not None:
                    extra_information += f"\nvertex_location: {_vertex_location}\n"
    
                # on litellm proxy add key name + team to exceptions
                extra_information = _add_key_name_and_team_to_alert(
                    request_info=extra_information, metadata=_metadata
                )
    
                ################################################################################
                # End of Common Extra information Needed for all providers
                ################################################################################
    
                ################################################################################
                #################### Start of Provider Exception mapping ####################
                ################################################################################
    
                if "Request Timeout Error" in error_str or "Request timed out" in error_str:
                    exception_mapping_worked = True
                    raise Timeout(
                        message=f"APITimeoutError - Request timed out. {extra_information} \n error_str: {error_str}",
                        model=model,
                        llm_provider=custom_llm_provider,
                    )
    
                if (
                    custom_llm_provider == "openai"
                    or custom_llm_provider == "text-completion-openai"
                    or custom_llm_provider == "custom_openai"
                    or custom_llm_provider in litellm.openai_compatible_providers
                ):
                    # custom_llm_provider is openai, make it OpenAI
                    if hasattr(original_exception, "message"):
                        message = original_exception.message
                    else:
                        message = str(original_exception)
                    if message is not None and isinstance(message, str):
                        message = message.replace("OPENAI", custom_llm_provider.upper())
                        message = message.replace("openai", custom_llm_provider)
                        message = message.replace("OpenAI", custom_llm_provider)
                    if custom_llm_provider == "openai":
                        exception_provider = "OpenAI" + "Exception"
                    else:
                        exception_provider = (
                            custom_llm_provider[0].upper()
                            + custom_llm_provider[1:]
                            + "Exception"
                        )
    
                    if (
                        "This model's maximum context length is" in error_str
                        or "Request too large" in error_str
                    ):
                        exception_mapping_worked = True
                        raise ContextWindowExceededError(
                            message=f"{exception_provider} - {message} {extra_information}",
                            llm_provider=custom_llm_provider,
                            model=model,
                            response=original_exception.response,
                        )
                    elif (
                        "invalid_request_error" in error_str
                        and "model_not_found" in error_str
                    ):
                        exception_mapping_worked = True
                        raise NotFoundError(
                            message=f"{exception_provider} - {message} {extra_information}",
                            llm_provider=custom_llm_provider,
                            model=model,
                            response=original_exception.response,
                        )
                    elif (
                        "invalid_request_error" in error_str
                        and "content_policy_violation" in error_str
                    ):
                        exception_mapping_worked = True
                        raise ContentPolicyViolationError(
                            message=f"{exception_provider} - {message} {extra_information}",
                            llm_provider=custom_llm_provider,
                            model=model,
                            response=original_exception.response,
                        )
                    elif (
                        "invalid_request_error" in error_str
                        and "Incorrect API key provided" not in error_str
                    ):
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"{exception_provider} - {message} {extra_information}",
                            llm_provider=custom_llm_provider,
                            model=model,
                            response=original_exception.response,
                        )
                    elif (
                        "The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable"
                        in error_str
                    ):
                        exception_mapping_worked = True
                        raise AuthenticationError(
                            message=f"{exception_provider} - {message} {extra_information}",
                            llm_provider=custom_llm_provider,
                            model=model,
                            response=original_exception.response,
                        )
                    elif "Mistral API raised a streaming error" in error_str:
                        exception_mapping_worked = True
                        _request = httpx.Request(
                            method="POST", url="https://api.openai.com/v1"
                        )
                        raise APIError(
                            status_code=500,
                            message=f"{exception_provider} - {message} {extra_information}",
                            llm_provider=custom_llm_provider,
                            model=model,
                            request=_request,
                        )
                    elif hasattr(original_exception, "status_code"):
                        exception_mapping_worked = True
                        if original_exception.status_code == 401:
                            exception_mapping_worked = True
                            raise AuthenticationError(
                                message=f"{exception_provider} - {message} {extra_information}",
                                llm_provider=custom_llm_provider,
                                model=model,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 404:
                            exception_mapping_worked = True
                            raise NotFoundError(
                                message=f"{exception_provider} - {message} {extra_information}",
                                model=model,
                                llm_provider=custom_llm_provider,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 408:
                            exception_mapping_worked = True
                            raise Timeout(
                                message=f"{exception_provider} - {message} {extra_information}",
                                model=model,
                                llm_provider=custom_llm_provider,
                            )
                        elif original_exception.status_code == 422:
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"{exception_provider} - {message} {extra_information}",
                                model=model,
                                llm_provider=custom_llm_provider,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 429:
                            exception_mapping_worked = True
                            raise RateLimitError(
                                message=f"{exception_provider} - {message} {extra_information}",
                                model=model,
                                llm_provider=custom_llm_provider,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 503:
                            exception_mapping_worked = True
                            raise ServiceUnavailableError(
                                message=f"{exception_provider} - {message} {extra_information}",
                                model=model,
                                llm_provider=custom_llm_provider,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 504:  # gateway timeout error
                            exception_mapping_worked = True
                            raise Timeout(
                                message=f"{exception_provider} - {message} {extra_information}",
                                model=model,
                                llm_provider=custom_llm_provider,
                            )
                        else:
                            exception_mapping_worked = True
                            raise APIError(
                                status_code=original_exception.status_code,
                                message=f"{exception_provider} - {message} {extra_information}",
                                llm_provider=custom_llm_provider,
                                model=model,
                                request=original_exception.request,
                            )
                    else:
                        # if no status code then it is an APIConnectionError: https://github.com/openai/openai-python#handling-errors
                        raise APIConnectionError(
                            message=f"{exception_provider} - {message} {extra_information}",
                            llm_provider=custom_llm_provider,
                            model=model,
                            request=httpx.Request(
                                method="POST", url="https://api.openai.com/v1/"
                            ),
                        )
                elif custom_llm_provider == "anthropic":  # one of the anthropics
                    if hasattr(original_exception, "message"):
                        if (
                            "prompt is too long" in original_exception.message
                            or "prompt: length" in original_exception.message
                        ):
                            exception_mapping_worked = True
                            raise ContextWindowExceededError(
                                message=original_exception.message,
                                model=model,
                                llm_provider="anthropic",
                                response=original_exception.response,
                            )
                        if "Invalid API Key" in original_exception.message:
                            exception_mapping_worked = True
                            raise AuthenticationError(
                                message=original_exception.message,
                                model=model,
                                llm_provider="anthropic",
                                response=original_exception.response,
                            )
                    if hasattr(original_exception, "status_code"):
                        print_verbose(f"status_code: {original_exception.status_code}")
                        if original_exception.status_code == 401:
                            exception_mapping_worked = True
                            raise AuthenticationError(
                                message=f"AnthropicException - {original_exception.message}",
                                llm_provider="anthropic",
                                model=model,
                                response=original_exception.response,
                            )
                        elif (
                            original_exception.status_code == 400
                            or original_exception.status_code == 413
                        ):
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"AnthropicException - {original_exception.message}",
                                model=model,
                                llm_provider="anthropic",
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 408:
                            exception_mapping_worked = True
                            raise Timeout(
                                message=f"AnthropicException - {original_exception.message}",
                                model=model,
                                llm_provider="anthropic",
                            )
                        elif original_exception.status_code == 429:
                            exception_mapping_worked = True
                            raise RateLimitError(
                                message=f"AnthropicException - {original_exception.message}",
                                llm_provider="anthropic",
                                model=model,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 500:
                            exception_mapping_worked = True
                            raise APIError(
                                status_code=500,
                                message=f"AnthropicException - {original_exception.message}. Handle with `litellm.APIError`.",
                                llm_provider="anthropic",
                                model=model,
                                request=original_exception.request,
                            )
                elif custom_llm_provider == "replicate":
                    if "Incorrect authentication token" in error_str:
                        exception_mapping_worked = True
                        raise AuthenticationError(
                            message=f"ReplicateException - {error_str}",
                            llm_provider="replicate",
                            model=model,
                            response=original_exception.response,
                        )
                    elif "input is too long" in error_str:
                        exception_mapping_worked = True
                        raise ContextWindowExceededError(
                            message=f"ReplicateException - {error_str}",
                            model=model,
                            llm_provider="replicate",
                            response=original_exception.response,
                        )
                    elif exception_type == "ModelError":
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"ReplicateException - {error_str}",
                            model=model,
                            llm_provider="replicate",
                            response=original_exception.response,
                        )
                    elif "Request was throttled" in error_str:
                        exception_mapping_worked = True
                        raise RateLimitError(
                            message=f"ReplicateException - {error_str}",
                            llm_provider="replicate",
                            model=model,
                            response=original_exception.response,
                        )
                    elif hasattr(original_exception, "status_code"):
                        if original_exception.status_code == 401:
                            exception_mapping_worked = True
                            raise AuthenticationError(
                                message=f"ReplicateException - {original_exception.message}",
                                llm_provider="replicate",
                                model=model,
                                response=original_exception.response,
                            )
                        elif (
                            original_exception.status_code == 400
                            or original_exception.status_code == 422
                            or original_exception.status_code == 413
                        ):
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"ReplicateException - {original_exception.message}",
                                model=model,
                                llm_provider="replicate",
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 408:
                            exception_mapping_worked = True
                            raise Timeout(
                                message=f"ReplicateException - {original_exception.message}",
                                model=model,
                                llm_provider="replicate",
                            )
                        elif original_exception.status_code == 429:
                            exception_mapping_worked = True
                            raise RateLimitError(
                                message=f"ReplicateException - {original_exception.message}",
                                llm_provider="replicate",
                                model=model,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 500:
                            exception_mapping_worked = True
                            raise ServiceUnavailableError(
                                message=f"ReplicateException - {original_exception.message}",
                                llm_provider="replicate",
                                model=model,
                                response=original_exception.response,
                            )
                    exception_mapping_worked = True
                    raise APIError(
                        status_code=500,
                        message=f"ReplicateException - {str(original_exception)}",
                        llm_provider="replicate",
                        model=model,
                        request=original_exception.request,
                    )
                elif custom_llm_provider == "bedrock":
                    if (
                        "too many tokens" in error_str
                        or "expected maxLength:" in error_str
                        or "Input is too long" in error_str
                        or "prompt: length: 1.." in error_str
                        or "Too many input tokens" in error_str
                    ):
                        exception_mapping_worked = True
                        raise ContextWindowExceededError(
                            message=f"BedrockException: Context Window Error - {error_str}",
                            model=model,
                            llm_provider="bedrock",
                            response=original_exception.response,
                        )
                    if "Malformed input request" in error_str:
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"BedrockException - {error_str}",
                            model=model,
                            llm_provider="bedrock",
                            response=original_exception.response,
                        )
                    if (
                        "Unable to locate credentials" in error_str
                        or "The security token included in the request is invalid"
                        in error_str
                    ):
                        exception_mapping_worked = True
                        raise AuthenticationError(
                            message=f"BedrockException Invalid Authentication - {error_str}",
                            model=model,
                            llm_provider="bedrock",
                            response=original_exception.response,
                        )
                    if "AccessDeniedException" in error_str:
                        exception_mapping_worked = True
                        raise PermissionDeniedError(
                            message=f"BedrockException PermissionDeniedError - {error_str}",
                            model=model,
                            llm_provider="bedrock",
                            response=original_exception.response,
                        )
                    if (
                        "throttlingException" in error_str
                        or "ThrottlingException" in error_str
                    ):
                        exception_mapping_worked = True
                        raise RateLimitError(
                            message=f"BedrockException: Rate Limit Error - {error_str}",
                            model=model,
                            llm_provider="bedrock",
                            response=original_exception.response,
                        )
                    if "Connect timeout on endpoint URL" in error_str:
                        exception_mapping_worked = True
                        raise Timeout(
                            message=f"BedrockException: Timeout Error - {error_str}",
                            model=model,
                            llm_provider="bedrock",
                        )
                    if hasattr(original_exception, "status_code"):
                        if original_exception.status_code == 500:
                            exception_mapping_worked = True
                            raise ServiceUnavailableError(
                                message=f"BedrockException - {original_exception.message}",
                                llm_provider="bedrock",
                                model=model,
                                response=httpx.Response(
                                    status_code=500,
                                    request=httpx.Request(
                                        method="POST", url="https://api.openai.com/v1/"
                                    ),
                                ),
                            )
                        elif original_exception.status_code == 401:
                            exception_mapping_worked = True
                            raise AuthenticationError(
                                message=f"BedrockException - {original_exception.message}",
                                llm_provider="bedrock",
                                model=model,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 400:
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"BedrockException - {original_exception.message}",
                                llm_provider="bedrock",
                                model=model,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 404:
                            exception_mapping_worked = True
                            raise NotFoundError(
                                message=f"BedrockException - {original_exception.message}",
                                llm_provider="bedrock",
                                model=model,
                                response=original_exception.response,
                            )
                elif custom_llm_provider == "sagemaker":
                    if "Unable to locate credentials" in error_str:
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"SagemakerException - {error_str}",
                            model=model,
                            llm_provider="sagemaker",
                            response=original_exception.response,
                        )
                    elif (
                        "Input validation error: `best_of` must be > 0 and <= 2"
                        in error_str
                    ):
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"SagemakerException - the value of 'n' must be > 0 and <= 2 for sagemaker endpoints",
                            model=model,
                            llm_provider="sagemaker",
                            response=original_exception.response,
                        )
                    elif (
                        "`inputs` tokens + `max_new_tokens` must be <=" in error_str
                        or "instance type with more CPU capacity or memory" in error_str
                    ):
                        exception_mapping_worked = True
                        raise ContextWindowExceededError(
                            message=f"SagemakerException - {error_str}",
                            model=model,
                            llm_provider="sagemaker",
                            response=original_exception.response,
                        )
                elif custom_llm_provider == "vertex_ai":
                    if (
                        "Vertex AI API has not been used in project" in error_str
                        or "Unable to find your project" in error_str
                    ):
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"VertexAIException - {error_str} {extra_information}",
                            model=model,
                            llm_provider="vertex_ai",
                            response=original_exception.response,
                        )
                    elif (
                        "None Unknown Error." in error_str
                        or "Content has no parts." in error_str
                    ):
                        exception_mapping_worked = True
                        raise APIError(
                            message=f"VertexAIException - {error_str} {extra_information}",
                            status_code=500,
                            model=model,
                            llm_provider="vertex_ai",
                            request=original_exception.request,
                        )
                    elif "403" in error_str:
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"VertexAIException - {error_str} {extra_information}",
                            model=model,
                            llm_provider="vertex_ai",
                            response=original_exception.response,
                        )
                    elif "The response was blocked." in error_str:
                        exception_mapping_worked = True
                        raise UnprocessableEntityError(
                            message=f"VertexAIException - {error_str} {extra_information}",
                            model=model,
                            llm_provider="vertex_ai",
                            response=httpx.Response(
                                status_code=429,
                                request=httpx.Request(
                                    method="POST",
                                    url=" https://cloud.google.com/vertex-ai/",
                                ),
                            ),
                        )
                    elif (
                        "429 Quota exceeded" in error_str
                        or "IndexError: list index out of range" in error_str
                        or "429 Unable to submit request because the service is temporarily out of capacity."
                        in error_str
                    ):
                        exception_mapping_worked = True
                        raise RateLimitError(
                            message=f"VertexAIException - {error_str} {extra_information}",
                            model=model,
                            llm_provider="vertex_ai",
                            response=httpx.Response(
                                status_code=429,
                                request=httpx.Request(
                                    method="POST",
                                    url=" https://cloud.google.com/vertex-ai/",
                                ),
                            ),
                        )
                    if hasattr(original_exception, "status_code"):
                        if original_exception.status_code == 400:
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"VertexAIException - {error_str} {extra_information}",
                                model=model,
                                llm_provider="vertex_ai",
                                response=original_exception.response,
                            )
                        if original_exception.status_code == 500:
                            exception_mapping_worked = True
                            raise APIError(
                                message=f"VertexAIException - {error_str} {extra_information}",
                                status_code=500,
                                model=model,
                                llm_provider="vertex_ai",
                                request=original_exception.request,
                            )
                elif custom_llm_provider == "palm" or custom_llm_provider == "gemini":
                    if "503 Getting metadata" in error_str:
                        # auth errors look like this
                        # 503 Getting metadata from plugin failed with error: Reauthentication is needed. Please run `gcloud auth application-default login` to reauthenticate.
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"GeminiException - Invalid api key",
                            model=model,
                            llm_provider="palm",
                            response=original_exception.response,
                        )
                    if (
                        "504 Deadline expired before operation could complete." in error_str
                        or "504 Deadline Exceeded" in error_str
                    ):
                        exception_mapping_worked = True
                        raise Timeout(
                            message=f"GeminiException - {original_exception.message}",
                            model=model,
                            llm_provider="palm",
                        )
                    if "400 Request payload size exceeds" in error_str:
                        exception_mapping_worked = True
                        raise ContextWindowExceededError(
                            message=f"GeminiException - {error_str}",
                            model=model,
                            llm_provider="palm",
                            response=original_exception.response,
                        )
                    if (
                        "500 An internal error has occurred." in error_str
                        or "list index out of range" in error_str
                    ):
                        exception_mapping_worked = True
                        raise APIError(
                            status_code=getattr(original_exception, "status_code", 500),
                            message=f"GeminiException - {original_exception.message}",
                            llm_provider="palm",
                            model=model,
                            request=httpx.Response(
                                status_code=429,
                                request=httpx.Request(
                                    method="POST",
                                    url=" https://cloud.google.com/vertex-ai/",
                                ),
                            ),
                        )
                    if hasattr(original_exception, "status_code"):
                        if original_exception.status_code == 400:
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"GeminiException - {error_str}",
                                model=model,
                                llm_provider="palm",
                                response=original_exception.response,
                            )
                    # Dailed: Error occurred: 400 Request payload size exceeds the limit: 20000 bytes
                elif custom_llm_provider == "cloudflare":
                    if "Authentication error" in error_str:
                        exception_mapping_worked = True
                        raise AuthenticationError(
                            message=f"Cloudflare Exception - {original_exception.message}",
                            llm_provider="cloudflare",
                            model=model,
                            response=original_exception.response,
                        )
                    if "must have required property" in error_str:
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"Cloudflare Exception - {original_exception.message}",
                            llm_provider="cloudflare",
                            model=model,
                            response=original_exception.response,
                        )
                elif (
                    custom_llm_provider == "cohere" or custom_llm_provider == "cohere_chat"
                ):  # Cohere
                    if (
                        "invalid api token" in error_str
                        or "No API key provided." in error_str
                    ):
                        exception_mapping_worked = True
                        raise AuthenticationError(
                            message=f"CohereException - {original_exception.message}",
                            llm_provider="cohere",
                            model=model,
                            response=original_exception.response,
                        )
                    elif "too many tokens" in error_str:
                        exception_mapping_worked = True
                        raise ContextWindowExceededError(
                            message=f"CohereException - {original_exception.message}",
                            model=model,
                            llm_provider="cohere",
                            response=original_exception.response,
                        )
                    elif hasattr(original_exception, "status_code"):
                        if (
                            original_exception.status_code == 400
                            or original_exception.status_code == 498
                        ):
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"CohereException - {original_exception.message}",
                                llm_provider="cohere",
                                model=model,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 500:
                            exception_mapping_worked = True
                            raise ServiceUnavailableError(
                                message=f"CohereException - {original_exception.message}",
                                llm_provider="cohere",
                                model=model,
                                response=original_exception.response,
                            )
                    elif (
                        "CohereConnectionError" in exception_type
                    ):  # cohere seems to fire these errors when we load test it (1k+ messages / min)
                        exception_mapping_worked = True
                        raise RateLimitError(
                            message=f"CohereException - {original_exception.message}",
                            llm_provider="cohere",
                            model=model,
                            response=original_exception.response,
                        )
                    elif "invalid type:" in error_str:
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"CohereException - {original_exception.message}",
                            llm_provider="cohere",
                            model=model,
                            response=original_exception.response,
                        )
                    elif "Unexpected server error" in error_str:
                        exception_mapping_worked = True
                        raise ServiceUnavailableError(
                            message=f"CohereException - {original_exception.message}",
                            llm_provider="cohere",
                            model=model,
                            response=original_exception.response,
                        )
                    else:
                        if hasattr(original_exception, "status_code"):
                            exception_mapping_worked = True
                            raise APIError(
                                status_code=original_exception.status_code,
                                message=f"CohereException - {original_exception.message}",
                                llm_provider="cohere",
                                model=model,
                                request=original_exception.request,
                            )
                        raise original_exception
                elif custom_llm_provider == "huggingface":
                    if "length limit exceeded" in error_str:
                        exception_mapping_worked = True
                        raise ContextWindowExceededError(
                            message=error_str,
                            model=model,
                            llm_provider="huggingface",
                            response=original_exception.response,
                        )
                    elif "A valid user token is required" in error_str:
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=error_str,
                            llm_provider="huggingface",
                            model=model,
                            response=original_exception.response,
                        )
                    if hasattr(original_exception, "status_code"):
                        if original_exception.status_code == 401:
                            exception_mapping_worked = True
                            raise AuthenticationError(
                                message=f"HuggingfaceException - {original_exception.message}",
                                llm_provider="huggingface",
                                model=model,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 400:
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"HuggingfaceException - {original_exception.message}",
                                model=model,
                                llm_provider="huggingface",
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 408:
                            exception_mapping_worked = True
                            raise Timeout(
                                message=f"HuggingfaceException - {original_exception.message}",
                                model=model,
                                llm_provider="huggingface",
                            )
                        elif original_exception.status_code == 429:
                            exception_mapping_worked = True
                            raise RateLimitError(
                                message=f"HuggingfaceException - {original_exception.message}",
                                llm_provider="huggingface",
                                model=model,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 503:
                            exception_mapping_worked = True
                            raise ServiceUnavailableError(
                                message=f"HuggingfaceException - {original_exception.message}",
                                llm_provider="huggingface",
                                model=model,
                                response=original_exception.response,
                            )
                        else:
                            exception_mapping_worked = True
                            raise APIError(
                                status_code=original_exception.status_code,
                                message=f"HuggingfaceException - {original_exception.message}",
                                llm_provider="huggingface",
                                model=model,
                                request=original_exception.request,
                            )
                elif custom_llm_provider == "ai21":
                    if hasattr(original_exception, "message"):
                        if "Prompt has too many tokens" in original_exception.message:
                            exception_mapping_worked = True
                            raise ContextWindowExceededError(
                                message=f"AI21Exception - {original_exception.message}",
                                model=model,
                                llm_provider="ai21",
                                response=original_exception.response,
                            )
                        if "Bad or missing API token." in original_exception.message:
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"AI21Exception - {original_exception.message}",
                                model=model,
                                llm_provider="ai21",
                                response=original_exception.response,
                            )
                    if hasattr(original_exception, "status_code"):
                        if original_exception.status_code == 401:
                            exception_mapping_worked = True
                            raise AuthenticationError(
                                message=f"AI21Exception - {original_exception.message}",
                                llm_provider="ai21",
                                model=model,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 408:
                            exception_mapping_worked = True
                            raise Timeout(
                                message=f"AI21Exception - {original_exception.message}",
                                model=model,
                                llm_provider="ai21",
                            )
                        if original_exception.status_code == 422:
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"AI21Exception - {original_exception.message}",
                                model=model,
                                llm_provider="ai21",
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 429:
                            exception_mapping_worked = True
                            raise RateLimitError(
                                message=f"AI21Exception - {original_exception.message}",
                                llm_provider="ai21",
                                model=model,
                                response=original_exception.response,
                            )
                        else:
                            exception_mapping_worked = True
                            raise APIError(
                                status_code=original_exception.status_code,
                                message=f"AI21Exception - {original_exception.message}",
                                llm_provider="ai21",
                                model=model,
                                request=original_exception.request,
                            )
                elif custom_llm_provider == "nlp_cloud":
                    if "detail" in error_str:
                        if "Input text length should not exceed" in error_str:
                            exception_mapping_worked = True
                            raise ContextWindowExceededError(
                                message=f"NLPCloudException - {error_str}",
                                model=model,
                                llm_provider="nlp_cloud",
                                response=original_exception.response,
                            )
                        elif "value is not a valid" in error_str:
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"NLPCloudException - {error_str}",
                                model=model,
                                llm_provider="nlp_cloud",
                                response=original_exception.response,
                            )
                        else:
                            exception_mapping_worked = True
                            raise APIError(
                                status_code=500,
                                message=f"NLPCloudException - {error_str}",
                                model=model,
                                llm_provider="nlp_cloud",
                                request=original_exception.request,
                            )
                    if hasattr(
                        original_exception, "status_code"
                    ):  # https://docs.nlpcloud.com/?shell#errors
                        if (
                            original_exception.status_code == 400
                            or original_exception.status_code == 406
                            or original_exception.status_code == 413
                            or original_exception.status_code == 422
                        ):
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"NLPCloudException - {original_exception.message}",
                                llm_provider="nlp_cloud",
                                model=model,
                                response=original_exception.response,
                            )
                        elif (
                            original_exception.status_code == 401
                            or original_exception.status_code == 403
                        ):
                            exception_mapping_worked = True
                            raise AuthenticationError(
                                message=f"NLPCloudException - {original_exception.message}",
                                llm_provider="nlp_cloud",
                                model=model,
                                response=original_exception.response,
                            )
                        elif (
                            original_exception.status_code == 522
                            or original_exception.status_code == 524
                        ):
                            exception_mapping_worked = True
                            raise Timeout(
                                message=f"NLPCloudException - {original_exception.message}",
                                model=model,
                                llm_provider="nlp_cloud",
                            )
                        elif (
                            original_exception.status_code == 429
                            or original_exception.status_code == 402
                        ):
                            exception_mapping_worked = True
                            raise RateLimitError(
                                message=f"NLPCloudException - {original_exception.message}",
                                llm_provider="nlp_cloud",
                                model=model,
                                response=original_exception.response,
                            )
                        elif (
                            original_exception.status_code == 500
                            or original_exception.status_code == 503
                        ):
                            exception_mapping_worked = True
                            raise APIError(
                                status_code=original_exception.status_code,
                                message=f"NLPCloudException - {original_exception.message}",
                                llm_provider="nlp_cloud",
                                model=model,
                                request=original_exception.request,
                            )
                        elif (
                            original_exception.status_code == 504
                            or original_exception.status_code == 520
                        ):
                            exception_mapping_worked = True
                            raise ServiceUnavailableError(
                                message=f"NLPCloudException - {original_exception.message}",
                                model=model,
                                llm_provider="nlp_cloud",
                                response=original_exception.response,
                            )
                        else:
                            exception_mapping_worked = True
                            raise APIError(
                                status_code=original_exception.status_code,
                                message=f"NLPCloudException - {original_exception.message}",
                                llm_provider="nlp_cloud",
                                model=model,
                                request=original_exception.request,
                            )
                elif custom_llm_provider == "together_ai":
                    import json
    
                    try:
                        error_response = json.loads(error_str)
                    except:
                        error_response = {"error": error_str}
                    if (
                        "error" in error_response
                        and "`inputs` tokens + `max_new_tokens` must be <="
                        in error_response["error"]
                    ):
                        exception_mapping_worked = True
                        raise ContextWindowExceededError(
                            message=f"TogetherAIException - {error_response['error']}",
                            model=model,
                            llm_provider="together_ai",
                            response=original_exception.response,
                        )
                    elif (
                        "error" in error_response
                        and "invalid private key" in error_response["error"]
                    ):
                        exception_mapping_worked = True
                        raise AuthenticationError(
                            message=f"TogetherAIException - {error_response['error']}",
                            llm_provider="together_ai",
                            model=model,
                            response=original_exception.response,
                        )
                    elif (
                        "error" in error_response
                        and "INVALID_ARGUMENT" in error_response["error"]
                    ):
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"TogetherAIException - {error_response['error']}",
                            model=model,
                            llm_provider="together_ai",
                            response=original_exception.response,
                        )
    
                    elif (
                        "error" in error_response
                        and "API key doesn't match expected format."
                        in error_response["error"]
                    ):
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"TogetherAIException - {error_response['error']}",
                            model=model,
                            llm_provider="together_ai",
                            response=original_exception.response,
                        )
                    elif (
                        "error_type" in error_response
                        and error_response["error_type"] == "validation"
                    ):
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"TogetherAIException - {error_response['error']}",
                            model=model,
                            llm_provider="together_ai",
                            response=original_exception.response,
                        )
                    if hasattr(original_exception, "status_code"):
                        if original_exception.status_code == 408:
                            exception_mapping_worked = True
                            raise Timeout(
                                message=f"TogetherAIException - {original_exception.message}",
                                model=model,
                                llm_provider="together_ai",
                            )
                        elif original_exception.status_code == 422:
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"TogetherAIException - {error_response['error']}",
                                model=model,
                                llm_provider="together_ai",
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 429:
                            exception_mapping_worked = True
                            raise RateLimitError(
                                message=f"TogetherAIException - {original_exception.message}",
                                llm_provider="together_ai",
                                model=model,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 524:
                            exception_mapping_worked = True
                            raise Timeout(
                                message=f"TogetherAIException - {original_exception.message}",
                                llm_provider="together_ai",
                                model=model,
                            )
                    else:
                        exception_mapping_worked = True
                        raise APIError(
                            status_code=original_exception.status_code,
                            message=f"TogetherAIException - {original_exception.message}",
                            llm_provider="together_ai",
                            model=model,
                            request=original_exception.request,
                        )
                elif custom_llm_provider == "aleph_alpha":
                    if (
                        "This is longer than the model's maximum context length"
                        in error_str
                    ):
                        exception_mapping_worked = True
                        raise ContextWindowExceededError(
                            message=f"AlephAlphaException - {original_exception.message}",
                            llm_provider="aleph_alpha",
                            model=model,
                            response=original_exception.response,
                        )
                    elif "InvalidToken" in error_str or "No token provided" in error_str:
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"AlephAlphaException - {original_exception.message}",
                            llm_provider="aleph_alpha",
                            model=model,
                            response=original_exception.response,
                        )
                    elif hasattr(original_exception, "status_code"):
                        print_verbose(f"status code: {original_exception.status_code}")
                        if original_exception.status_code == 401:
                            exception_mapping_worked = True
                            raise AuthenticationError(
                                message=f"AlephAlphaException - {original_exception.message}",
                                llm_provider="aleph_alpha",
                                model=model,
                            )
                        elif original_exception.status_code == 400:
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"AlephAlphaException - {original_exception.message}",
                                llm_provider="aleph_alpha",
                                model=model,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 429:
                            exception_mapping_worked = True
                            raise RateLimitError(
                                message=f"AlephAlphaException - {original_exception.message}",
                                llm_provider="aleph_alpha",
                                model=model,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 500:
                            exception_mapping_worked = True
                            raise ServiceUnavailableError(
                                message=f"AlephAlphaException - {original_exception.message}",
                                llm_provider="aleph_alpha",
                                model=model,
                                response=original_exception.response,
                            )
                        raise original_exception
                    raise original_exception
                elif (
                    custom_llm_provider == "ollama" or custom_llm_provider == "ollama_chat"
                ):
                    if isinstance(original_exception, dict):
                        error_str = original_exception.get("error", "")
                    else:
                        error_str = str(original_exception)
                    if "no such file or directory" in error_str:
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"OllamaException: Invalid Model/Model not loaded - {original_exception}",
                            model=model,
                            llm_provider="ollama",
                            response=original_exception.response,
                        )
                    elif "Failed to establish a new connection" in error_str:
                        exception_mapping_worked = True
                        raise ServiceUnavailableError(
                            message=f"OllamaException: {original_exception}",
                            llm_provider="ollama",
                            model=model,
                            response=original_exception.response,
                        )
                    elif "Invalid response object from API" in error_str:
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"OllamaException: {original_exception}",
                            llm_provider="ollama",
                            model=model,
                            response=original_exception.response,
                        )
                    elif "Read timed out" in error_str:
                        exception_mapping_worked = True
                        raise Timeout(
                            message=f"OllamaException: {original_exception}",
                            llm_provider="ollama",
                            model=model,
                        )
                elif custom_llm_provider == "vllm":
                    if hasattr(original_exception, "status_code"):
                        if original_exception.status_code == 0:
                            exception_mapping_worked = True
                            raise APIConnectionError(
                                message=f"VLLMException - {original_exception.message}",
                                llm_provider="vllm",
                                model=model,
                                request=original_exception.request,
                            )
                elif custom_llm_provider == "azure":
                    if "Internal server error" in error_str:
                        exception_mapping_worked = True
                        raise APIError(
                            status_code=500,
                            message=f"AzureException - {original_exception.message} {extra_information}",
                            llm_provider="azure",
                            model=model,
                            request=httpx.Request(method="POST", url="https://openai.com/"),
                        )
                    elif "This model's maximum context length is" in error_str:
                        exception_mapping_worked = True
                        raise ContextWindowExceededError(
                            message=f"AzureException - {original_exception.message} {extra_information}",
                            llm_provider="azure",
                            model=model,
                            response=original_exception.response,
                        )
                    elif "DeploymentNotFound" in error_str:
                        exception_mapping_worked = True
                        raise NotFoundError(
                            message=f"AzureException - {original_exception.message} {extra_information}",
                            llm_provider="azure",
                            model=model,
                            response=original_exception.response,
                        )
                    elif (
                        "invalid_request_error" in error_str
                        and "content_policy_violation" in error_str
                    ) or (
                        "The response was filtered due to the prompt triggering Azure OpenAI's content management"
                        in error_str
                    ):
                        exception_mapping_worked = True
                        raise ContentPolicyViolationError(
                            message=f"AzureException - {original_exception.message} {extra_information}",
                            llm_provider="azure",
                            model=model,
                            response=original_exception.response,
                        )
                    elif "invalid_request_error" in error_str:
                        exception_mapping_worked = True
                        raise BadRequestError(
                            message=f"AzureException - {original_exception.message} {extra_information}",
                            llm_provider="azure",
                            model=model,
                            response=original_exception.response,
                        )
                    elif (
                        "The api_key client option must be set either by passing api_key to the client or by setting"
                        in error_str
                    ):
                        exception_mapping_worked = True
                        raise AuthenticationError(
                            message=f"{exception_provider} - {original_exception.message} {extra_information}",
                            llm_provider=custom_llm_provider,
                            model=model,
                            response=original_exception.response,
                        )
                    elif hasattr(original_exception, "status_code"):
                        exception_mapping_worked = True
                        if original_exception.status_code == 401:
                            exception_mapping_worked = True
                            raise AuthenticationError(
                                message=f"AzureException - {original_exception.message} {extra_information}",
                                llm_provider="azure",
                                model=model,
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 408:
                            exception_mapping_worked = True
                            raise Timeout(
                                message=f"AzureException - {original_exception.message} {extra_information}",
                                model=model,
                                llm_provider="azure",
                            )
                        if original_exception.status_code == 422:
                            exception_mapping_worked = True
                            raise BadRequestError(
                                message=f"AzureException - {original_exception.message} {extra_information}",
                                model=model,
                                llm_provider="azure",
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 429:
                            exception_mapping_worked = True
                            raise RateLimitError(
                                message=f"AzureException - {original_exception.message} {extra_information}",
                                model=model,
                                llm_provider="azure",
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 503:
                            exception_mapping_worked = True
                            raise ServiceUnavailableError(
                                message=f"AzureException - {original_exception.message} {extra_information}",
                                model=model,
                                llm_provider="azure",
                                response=original_exception.response,
                            )
                        elif original_exception.status_code == 504:  # gateway timeout error
                            exception_mapping_worked = True
                            raise Timeout(
                                message=f"AzureException - {original_exception.message} {extra_information}",
                                model=model,
                                llm_provider="azure",
                            )
                        else:
                            exception_mapping_worked = True
                            raise APIError(
                                status_code=original_exception.status_code,
                                message=f"AzureException - {original_exception.message} {extra_information}",
                                llm_provider="azure",
                                model=model,
                                request=httpx.Request(
                                    method="POST", url="https://openai.com/"
                                ),
                            )
                    else:
                        # if no status code then it is an APIConnectionError: https://github.com/openai/openai-python#handling-errors
                        raise APIConnectionError(
                            message=f"{exception_provider} - {message} {extra_information}",
                            llm_provider="azure",
                            model=model,
                            request=httpx.Request(method="POST", url="https://openai.com/"),
                        )
            if (
                "BadRequestError.__init__() missing 1 required positional argument: 'param'"
                in str(original_exception)
            ):  # deal with edge-case invalid request error bug in openai-python sdk
                exception_mapping_worked = True
                raise BadRequestError(
                    message=f"{exception_provider}: This can happen due to missing AZURE_API_VERSION: {str(original_exception)}",
                    model=model,
                    llm_provider=custom_llm_provider,
                    response=original_exception.response,
                )
            else:  # ensure generic errors always return APIConnectionError=
                exception_mapping_worked = True
                if hasattr(original_exception, "request"):
                    raise APIConnectionError(
                        message=f"{str(original_exception)}",
                        llm_provider=custom_llm_provider,
                        model=model,
                        request=original_exception.request,
                    )
                else:
>                   raise APIConnectionError(
                        message=f"{str(original_exception)}",
                        llm_provider=custom_llm_provider,
                        model=model,
                        request=httpx.Request(
                            method="POST", url="https://api.openai.com/v1/"
                        ),  # stub the request
                    )
E                   litellm.exceptions.APIConnectionError: 'stream'

../utils.py:9328: APIConnectionError

During handling of the above exception, another exception occurred:

sync_mode = True

    @pytest.mark.parametrize("sync_mode", [True, False])
    @pytest.mark.asyncio
    async def test_completion_predibase_streaming(sync_mode):
        try:
            litellm.set_verbose = True
    
            if sync_mode:
                response = completion(
                    model="predibase/llama-3-8b-instruct",
                    tenant_id="c4768f95",
                    api_base="https://serving.app.predibase.com",
                    api_key=os.getenv("PREDIBASE_API_KEY"),
                    messages=[{"role": "user", "content": "What is the meaning of life?"}],
                    stream=True,
                )
    
                complete_response = ""
                for idx, init_chunk in enumerate(response):
                    chunk, finished = streaming_format_tests(idx, init_chunk)
                    complete_response += chunk
                    custom_llm_provider = init_chunk._hidden_params["custom_llm_provider"]
                    print(f"custom_llm_provider: {custom_llm_provider}")
                    assert custom_llm_provider == "predibase"
                    if finished:
                        assert isinstance(
                            init_chunk.choices[0], litellm.utils.StreamingChoices
                        )
                        break
                if complete_response.strip() == "":
                    raise Exception("Empty response received")
            else:
                response = await litellm.acompletion(
                    model="predibase/llama-3-8b-instruct",
                    tenant_id="c4768f95",
                    api_base="https://serving.app.predibase.com",
                    api_key=os.getenv("PREDIBASE_API_KEY"),
                    messages=[{"role": "user", "content": "What is the meaning of life?"}],
                    stream=True,
                )
    
                # await response
    
                complete_response = ""
                idx = 0
                async for init_chunk in response:
                    chunk, finished = streaming_format_tests(idx, init_chunk)
                    complete_response += chunk
                    custom_llm_provider = init_chunk._hidden_params["custom_llm_provider"]
                    print(f"custom_llm_provider: {custom_llm_provider}")
                    assert custom_llm_provider == "predibase"
                    idx += 1
                    if finished:
                        assert isinstance(
                            init_chunk.choices[0], litellm.utils.StreamingChoices
                        )
                        break
                if complete_response.strip() == "":
                    raise Exception("Empty response received")
    
            print(f"complete_response: {complete_response}")
        except litellm.Timeout as e:
            pass
        except Exception as e:
>           pytest.fail(f"Error occurred: {e}")
E           Failed: Error occurred: 'stream'

test_streaming.py:373: Failed
---------------------------- Captured stdout setup -----------------------------
<module 'litellm' from '/Users/krrishdholakia/Documents/litellm/litellm/__init__.py'>
----------------------------- Captured stdout call -----------------------------


[92mRequest to litellm:[0m
[92mlitellm.completion(model='predibase/llama-3-8b-instruct', tenant_id='c4768f95', api_base='https://serving.app.predibase.com', api_key='pb_Qg9YbQo7UqqHdu0ozxN_aw', messages=[{'role': 'user', 'content': 'What is the meaning of life?'}], stream=True)[0m


self.optional_params: {}
SYNC kwargs[caching]: False; litellm.cache: None; kwargs.get('cache')['no-cache']: False
UNMAPPED PROVIDER, ASSUMING IT'S OPENAI/AZURE - model=llama-3-8b-instruct, custom_llm_provider=predibase
Final returned optional params: {'stream': True, 'tenant_id': 'c4768f95'}
self.optional_params: {'stream': True, 'tenant_id': 'c4768f95'}
[92m

POST Request Sent from LiteLLM:
curl -X POST \
https://serving.app.predibase.com/c4768f95/deployments/v2/llms/llama-3-8b-instruct/generate_stream \
-H 'content-type: application/json' -H 'Authorization: Bearer pb_Qg********************' \
-d '{'inputs': 'What is the meaning of life?', 'parameters': {'details': True, 'max_new_tokens': 256, 'return_full_text': False}}'
[0m


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'.

Logging Details: logger_fn - None | callable(logger_fn) - False
Logging Details LiteLLM-Failure Call
self.failure_callback: []
=============================== warnings summary ===============================
../../../../../../opt/homebrew/lib/python3.11/site-packages/pydantic/_internal/_config.py:284: 25 warnings
  /opt/homebrew/lib/python3.11/site-packages/pydantic/_internal/_config.py:284: PydanticDeprecatedSince20: Support for class-based `config` is deprecated, use ConfigDict instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.7/migration/
    warnings.warn(DEPRECATION_MESSAGE, DeprecationWarning)

../proxy/_types.py:219
  /Users/krrishdholakia/Documents/litellm/litellm/proxy/_types.py:219: PydanticDeprecatedSince20: Pydantic V1 style `@root_validator` validators are deprecated. You should migrate to Pydantic V2 style `@model_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.7/migration/
    @root_validator(pre=True)

../proxy/_types.py:306
  /Users/krrishdholakia/Documents/litellm/litellm/proxy/_types.py:306: PydanticDeprecatedSince20: `pydantic.config.Extra` is deprecated, use literal values instead (e.g. `extra='allow'`). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.7/migration/
    extra = Extra.allow  # Allow extra fields

../proxy/_types.py:309
  /Users/krrishdholakia/Documents/litellm/litellm/proxy/_types.py:309: PydanticDeprecatedSince20: Pydantic V1 style `@root_validator` validators are deprecated. You should migrate to Pydantic V2 style `@model_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.7/migration/
    @root_validator(pre=True)

../proxy/_types.py:338
  /Users/krrishdholakia/Documents/litellm/litellm/proxy/_types.py:338: PydanticDeprecatedSince20: Pydantic V1 style `@root_validator` validators are deprecated. You should migrate to Pydantic V2 style `@model_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.7/migration/
    @root_validator(pre=True)

../proxy/_types.py:385
  /Users/krrishdholakia/Documents/litellm/litellm/proxy/_types.py:385: PydanticDeprecatedSince20: Pydantic V1 style `@root_validator` validators are deprecated. You should migrate to Pydantic V2 style `@model_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.7/migration/
    @root_validator(pre=True)

../proxy/_types.py:454
  /Users/krrishdholakia/Documents/litellm/litellm/proxy/_types.py:454: PydanticDeprecatedSince20: Pydantic V1 style `@root_validator` validators are deprecated. You should migrate to Pydantic V2 style `@model_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.7/migration/
    @root_validator(pre=True)

../proxy/_types.py:474
  /Users/krrishdholakia/Documents/litellm/litellm/proxy/_types.py:474: PydanticDeprecatedSince20: Pydantic V1 style `@root_validator` validators are deprecated. You should migrate to Pydantic V2 style `@model_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.7/migration/
    @root_validator(pre=True)

../proxy/_types.py:487
  /Users/krrishdholakia/Documents/litellm/litellm/proxy/_types.py:487: PydanticDeprecatedSince20: Pydantic V1 style `@root_validator` validators are deprecated. You should migrate to Pydantic V2 style `@model_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.7/migration/
    @root_validator(pre=True)

../proxy/_types.py:532
  /Users/krrishdholakia/Documents/litellm/litellm/proxy/_types.py:532: PydanticDeprecatedSince20: Pydantic V1 style `@root_validator` validators are deprecated. You should migrate to Pydantic V2 style `@model_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.7/migration/
    @root_validator(pre=True)

../proxy/_types.py:569
  /Users/krrishdholakia/Documents/litellm/litellm/proxy/_types.py:569: PydanticDeprecatedSince20: Pydantic V1 style `@root_validator` validators are deprecated. You should migrate to Pydantic V2 style `@model_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.7/migration/
    @root_validator(pre=True)

../proxy/_types.py:864
  /Users/krrishdholakia/Documents/litellm/litellm/proxy/_types.py:864: PydanticDeprecatedSince20: Pydantic V1 style `@root_validator` validators are deprecated. You should migrate to Pydantic V2 style `@model_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.7/migration/
    @root_validator(pre=True)

../proxy/_types.py:891
  /Users/krrishdholakia/Documents/litellm/litellm/proxy/_types.py:891: PydanticDeprecatedSince20: Pydantic V1 style `@root_validator` validators are deprecated. You should migrate to Pydantic V2 style `@model_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.7/migration/
    @root_validator(pre=True)

../proxy/_types.py:912
  /Users/krrishdholakia/Documents/litellm/litellm/proxy/_types.py:912: PydanticDeprecatedSince20: Pydantic V1 style `@root_validator` validators are deprecated. You should migrate to Pydantic V2 style `@model_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.7/migration/
    @root_validator(pre=True)

../utils.py:39
  /Users/krrishdholakia/Documents/litellm/litellm/utils.py:39: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    import pkg_resources  # type: ignore

../../../../../../opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2832: 10 warnings
  /opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2832: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)

../../../../../../opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2832
../../../../../../opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2832
../../../../../../opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2832
../../../../../../opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2832
../../../../../../opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2832
  /opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2832: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google.cloud')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)

../../../../../../opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2317
../../../../../../opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2317
../../../../../../opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2317
  /opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2317: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(parent)

../../../../../../opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2832
  /opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2832: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google.logging')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)

../../../../../../opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2832
  /opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2832: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google.iam')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)

../../../../../../opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2832
  /opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2832: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('mpl_toolkits')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)

../../../../../../opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2832
  /opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2832: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('sphinxcontrib')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)

../../../../../../opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2832
../../../../../../opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2832
  /opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2832: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('zope')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)

test_streaming.py::test_completion_predibase_streaming[False]
  /opt/homebrew/lib/python3.11/site-packages/httpx/_content.py:204: DeprecationWarning: Use 'content=<...>' to upload raw bytes/text content.
    warnings.warn(message, DeprecationWarning)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED test_streaming.py::test_completion_predibase_streaming[True] - Failed:...
=================== 1 failed, 1 passed, 64 warnings in 5.28s ===================
