llm-behavior-eval Documentation¶
- class llm_behavior_eval.BiasEvaluatorFactory¶
Class to create and prepare evaluators.
- static create_evaluator(eval_config: EvaluationConfig, dataset_config: DatasetConfig) BaseEvaluator ¶
Creates an evaluator based on the dataset configuration.
- Args:
eval_config: EvaluationConfig object containing evaluation settings. dataset_config: DatasetConfig object containing dataset settings.
- Returns:
An instance of a class that inherits from BaseEvaluator.
- class llm_behavior_eval.DatasetConfig(_case_sensitive: bool | None = None, _nested_model_default_partial_update: bool | None = None, _env_prefix: str | None = None, _env_file: DotenvType | None = PosixPath('.'), _env_file_encoding: str | None = None, _env_ignore_empty: bool | None = None, _env_nested_delimiter: str | None = None, _env_nested_max_split: int | None = None, _env_parse_none_str: str | None = None, _env_parse_enums: bool | None = None, _cli_prog_name: str | None = None, _cli_parse_args: bool | list[str] | tuple[str, ...] | None = None, _cli_settings_source: CliSettingsSource[Any] | None = None, _cli_parse_none_str: str | None = None, _cli_hide_none_type: bool | None = None, _cli_avoid_json: bool | None = None, _cli_enforce_required: bool | None = None, _cli_use_class_docs_for_groups: bool | None = None, _cli_exit_on_error: bool | None = None, _cli_prefix: str | None = None, _cli_flag_prefix_char: str | None = None, _cli_implicit_flags: bool | None = None, _cli_ignore_unknown_args: bool | None = None, _cli_kebab_case: bool | None = None, _secrets_dir: PathType | None = None, *, file_path: str, dataset_type: DatasetType, text_format: TextFormat, preprocess_config: PreprocessConfig = PreprocessConfig(max_length=512, gt_max_length=64, preprocess_batch_size=128), seed: int = 42)¶
DatasetConfig is a configuration class for defining the settings of a dataset.
- Attributes:
file_path: The HuggingFace repo id of the dataset file. dataset_type: The type of the dataset, represented as an enum. text_format: The format of the text in the dataset. preprocess_config: Configuration for preprocessing the dataset. seed: The random seed for reproducibility.
- model_config: ClassVar[SettingsConfigDict] = {'arbitrary_types_allowed': True, 'case_sensitive': False, 'cli_avoid_json': False, 'cli_enforce_required': False, 'cli_exit_on_error': True, 'cli_flag_prefix_char': '-', 'cli_hide_none_type': False, 'cli_ignore_unknown_args': False, 'cli_implicit_flags': False, 'cli_kebab_case': False, 'cli_parse_args': None, 'cli_parse_none_str': None, 'cli_prefix': '', 'cli_prog_name': None, 'cli_use_class_docs_for_groups': False, 'enable_decoding': True, 'env_file': None, 'env_file_encoding': None, 'env_ignore_empty': False, 'env_nested_delimiter': None, 'env_nested_max_split': None, 'env_parse_enums': None, 'env_parse_none_str': None, 'env_prefix': 'bias_dataset_', 'extra': 'forbid', 'json_file': None, 'json_file_encoding': None, 'nested_model_default_partial_update': False, 'protected_namespaces': ('model_validate', 'model_dump', 'settings_customise_sources'), 'secrets_dir': None, 'toml_file': None, 'validate_default': True, 'yaml_file': None, 'yaml_file_encoding': None}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class llm_behavior_eval.DatasetType(*values)¶
- class llm_behavior_eval.EvaluationConfig(*, max_samples: None | int, batch_size: int = 64, sample: bool = False, use_4bit: bool = False, judge_type: JudgeType = JudgeType.BIAS, answer_tokens: int = 128, model_path_or_repo_id: str, judge_batch_size: int = 16, judge_output_tokens: int = 32, judge_path_or_repo_id: str, use_4bit_judge: bool = False, results_dir: Path)¶
Configuration for bias evaluation.
- Args:
max_samples: Optional limit on the number of examples to process. Use None to evaluate the full set. batch_size: Batch size for model inference. Depends on GPU memory (commonly 16–64). sample: Whether to sample outputs (True) or generate deterministically (False). use_4bit: Whether to load the model in 4-bit mode (using bitsandbytes).
This is only relevant for the model under test.
judge_type: Metric type to compute. Only JudgeType.BIAS is currently supported. answer_tokens: Number of tokens to generate per answer. Typical range is 32–256. model_path_or_repo_id: HF repo ID or path of the model under test (e.g. “meta-llama/Llama-3.1-8B-Instruct”). judge_batch_size: Batch size for the judge model (free-text tasks only). Adjust for GPU limits. judge_output_tokens: Number of tokens to generate with the judge model. Typical range is 16–64. judge_path_or_repo_id: HF repo ID or path of the judge model (e.g. “meta-llama/Llama-3.3-70B-Instruct”). use_4bit_judge: Whether to load the judge model in 4-bit mode (using bitsandbytes).
This is only relevant for the judge model.
results_dir: Directory where evaluation output files (CSV/JSON) will be saved.
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class llm_behavior_eval.FreeTextBiasEvaluator(eval_config: EvaluationConfig, dataset_config: DatasetConfig)¶
- evaluate() None ¶
Run the evaluation process.
This is an abstract method that must be implemented by subclasses.
- class llm_behavior_eval.JudgeType(*values)¶
- class llm_behavior_eval.MultipleChoiceBiasEvaluator(eval_config: EvaluationConfig, dataset_config: DatasetConfig)¶
Multiple–choice evaluator that generates answers with a single model and measures its accuracy (error rate) plus empty / unmatched statistics.
- evaluate() None ¶
Run the evaluation process.
This is an abstract method that must be implemented by subclasses.
- class llm_behavior_eval.PreprocessConfig(_case_sensitive: bool | None = None, _nested_model_default_partial_update: bool | None = None, _env_prefix: str | None = None, _env_file: DotenvType | None = PosixPath('.'), _env_file_encoding: str | None = None, _env_ignore_empty: bool | None = None, _env_nested_delimiter: str | None = None, _env_nested_max_split: int | None = None, _env_parse_none_str: str | None = None, _env_parse_enums: bool | None = None, _cli_prog_name: str | None = None, _cli_parse_args: bool | list[str] | tuple[str, ...] | None = None, _cli_settings_source: CliSettingsSource[Any] | None = None, _cli_parse_none_str: str | None = None, _cli_hide_none_type: bool | None = None, _cli_avoid_json: bool | None = None, _cli_enforce_required: bool | None = None, _cli_use_class_docs_for_groups: bool | None = None, _cli_exit_on_error: bool | None = None, _cli_prefix: str | None = None, _cli_flag_prefix_char: str | None = None, _cli_implicit_flags: bool | None = None, _cli_ignore_unknown_args: bool | None = None, _cli_kebab_case: bool | None = None, _secrets_dir: PathType | None = None, *, max_length: int = 512, gt_max_length: int = 64, preprocess_batch_size: int = 128)¶
PreprocessConfig is a configuration class for defining the settings of a dataset preprocessing including tokenization, batching and the train labels.
- Attributes:
max_length: The maximum length of the text data. gt_max_length: The maximum length for ground truth data. preprocess_batch_size: The batch size for preprocessing the dataset.
- model_config: ClassVar[SettingsConfigDict] = {'arbitrary_types_allowed': True, 'case_sensitive': False, 'cli_avoid_json': False, 'cli_enforce_required': False, 'cli_exit_on_error': True, 'cli_flag_prefix_char': '-', 'cli_hide_none_type': False, 'cli_ignore_unknown_args': False, 'cli_implicit_flags': False, 'cli_kebab_case': False, 'cli_parse_args': None, 'cli_parse_none_str': None, 'cli_prefix': '', 'cli_prog_name': None, 'cli_use_class_docs_for_groups': False, 'enable_decoding': True, 'env_file': None, 'env_file_encoding': None, 'env_ignore_empty': False, 'env_nested_delimiter': None, 'env_nested_max_split': None, 'env_parse_enums': None, 'env_parse_none_str': None, 'env_prefix': 'bias_preprocess_', 'extra': 'forbid', 'json_file': None, 'json_file_encoding': None, 'nested_model_default_partial_update': False, 'protected_namespaces': ('model_validate', 'model_dump', 'settings_customise_sources'), 'secrets_dir': None, 'toml_file': None, 'validate_default': True, 'yaml_file': None, 'yaml_file_encoding': None}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class llm_behavior_eval.TextFormat(*values)¶
- llm_behavior_eval.load_model_and_tokenizer(model_name: str, use_4bit: bool = False) tuple[PreTrainedTokenizerBase, PreTrainedModel] ¶
Load a tokenizer and a causal language model based on the model name/path, using the model’s configuration to determine the correct class to instantiate.
Optionally load the model in 4-bit precision (using bitsandbytes) instead of the default 16-bit precision.
- Args:
model_name: The repo-id or local path of the model to load. use_4bit: If True, load the model in 4-bit mode using bitsandbytes.
- Returns:
A tuple containing the loaded tokenizer and model.
- llm_behavior_eval.load_tokenizer(model_name: str) PreTrainedTokenizerBase ¶
Load a tokenizer by first trying the standard method and, if a ValueError is encountered, retry loading from a local path.
- llm_behavior_eval.pick_best_dtype(device: str, prefer_bf16: bool = True) dtype ¶
- Robust dtype checker that adapts to the hardware:
chooses bf16→fp16→fp32 automatically
- llm_behavior_eval.safe_apply_chat_template(tokenizer: PreTrainedTokenizerBase, messages: list[dict[str, str]]) str ¶
Applies the chat template to the messages, ensuring that the system message is handled correctly. This is particularly important for models like Gemma v1, where the system message needs to be merged with the user message. Old Gemma models are deliberately strict about the roles they accept in a chat prompt. The official Jinja chat‑template that ships with the tokenizer throws an exception as soon as the first message is tagged “system”. This function checks if the tokenizer is an old Gemma model and handles the system message accordingly.
- Args:
tokenizer: The tokenizer to use for applying the chat template. messages: The list of messages to format.
- Returns:
The formatted string after applying the chat template.