Random Rule Forest (RRF)#

Module contents#

Random Rule Forest.

An interpretable ensemble framework for binary classification based on YES/NO questions generated by LLMs.

class QuestionExclusion(*values)#

Bases: StrEnum

EXPERT = 'expert'#

PREDICTION_SIMILARITY = 'prediction_similarity'#

SEMANTICS = 'semantics'#

class RRF(qgen_llmc, qanswer_llmc=None, qgen_temperature=0.0, qanswer_temperature=0.0, llm_semaphore_limit=3, answer_similarity_func='hamming', max_generated_questions=100, max_samples_as_context=30, class_ratio=(1.0, 1.0), q_answer_update_interval=10, save_path=None, name=None, random_state=42, use_cumulative_memory=True, qanswer_batch_size=None)#

Bases: object

Interpretable ensemble binary classifier.

Parameters:

qgen_llmc (List[Union[AnthropicChoice, GoogleChoice, OpenAIChoice, XAIChoice, AnthropicChoiceDict, GoogleChoiceDict, OpenAIChoiceDict, XAIChoiceDict]]) – LLMs to use for question generation, in priority order.
qanswer_llmc (Optional[List[Union[AnthropicChoice, GoogleChoice, OpenAIChoice, XAIChoice, AnthropicChoiceDict, GoogleChoiceDict, OpenAIChoiceDict, XAIChoiceDict]]]) – LLMs to use for answering questions, in priority order. If None, use qgen_llmc.
qgen_temperature (float) – Sampling temperature for question generation.
qanswer_temperature (float) – Sampling temperature for answering questions.
llm_semaphore_limit (int) – Max concurrent LLM calls.
answer_similarity_func (Union[str, Literal['jaccard', 'hamming']]) – Function to use for answer similarity.
max_generated_questions (int) – Maximum number of questions to generate. Max 1000
max_samples_as_context (int) – Number of samples used as context in a round of question generation. max 100 Max 100
class_ratio (Tuple[float, float]) – Ratio of YES to NO samples to use as context in a round of question generation.
q_answer_update_interval (int) – Logging interval of question answering.
save_path (str | PathLike[str] | None) – Directory to save checkpoints/models.
name (str | None) – Name of the forest instance.
random_state (int) – Random seed.
use_cumulative_memory (bool) – Whether to use cumulative memory when generating questions across multiple LLM calls.
qanswer_batch_size (int | None) – Maximum number of samples to answer in a single LLM call. If None or 1, batching is disabled and the original per-sample behaviour is used (one LLM call per sample). Set >1 to enable true batched answering.

classmethod load(dir_path)#

Load an RRF saved by save.

Return type:: RRF
Parameters:: dir_path (str | PathLike[str])

async add_question(question)#

Add a question to the RRF.

Parameters:: question (str) – The question to add.
Raises:: ValueError – If question already exists.
Return type:: Literal[True]

filter_questions_on_pred_similarity(threshold)#

Filter questions on prediction similarity.

If two questions have a prediction similarity greater than or equal to the threshold, the question with the lower f1 score is excluded.

Parameters:

threshold (float | None) – Threshold for prediction similarity. If None, no filtering is
similarity. (done based on prediction)

Raises:

AssertionError – If threshold is not between > 0 and <= 1.

Return type:

None

async filter_questions_on_semantics(threshold, emb_model)#

Filter questions on semantics.

If two questions have a semantic similarity greater than or equal to the threshold, the question with the lower f1 score is excluded.

Parameters:

threshold (float | None) – Threshold for semantic filtering. If None, no filtering is
similarity. (done based on semantic)
emb_model (Union[str, Literal['hashed_bag_of_words']]) – Embedding model to use for semantic filtering.

Raises:

AssertionError – If threshold is not between > 0 and <= 1.

Return type:

None

async fit(X=None, y=None, *, copy_data=True, reset=False)#

Fit the RRF to the data.

Parameters:

X (DataFrame | None) – Training features. Required on first run or with reset=True.
y (Optional[Sequence[str]]) – Training labels. Required on first run or with reset=True.
copy_data (bool) – Whether to copy input data.
reset (bool) – Clear existing state and restart forest generation.

Returns:

Updated RRF.

Return type:

Self

Raises:

ValueError – If data requirements aren’t met or invalid reset usage.

get_answers()#

Get answers dataframe.

Returns:

Columns as the questions ids (not excluded during semantic filtering).
Index as the samples indices from self._X.

Return type:

DataFrame containing

get_questions()#

Get generated questions dataframe.

Returns:

question: Generated question text
embedding: Question embedding vectors
exclusion: Question exclusion principle
precision: Precision for each question
recall: Recall for each question
f1_score: F1 score for each question
accuracy: Accuracy for each question

Return type:

DataFrame containing

async predict(samples)#

Predict labels for samples.

Parameters:: samples (DataFrame) – Samples to predict.
Return type:: AsyncGenerator[Tuple[Any, str, Literal['YES', 'NO'], TokenCounter], None]
Returns:: Generator of predictions[sample_index, question, answer, token_counter]
Raises:: ValueError – If samples is empty or does not have the correct column.

save(dir_path=None, for_production=False)#

Save model config to JSON and dataframes to parquet in a directory.

If dir_path is None, uses <self.save_path>/<self.name>. If for_production is True, strips the questions dataframe and does not save the answers and training dataframes.

Parameters:

dir_path (str | PathLike[str] | None) – The directory to save the RRF to.
for_production (bool) – Whether to save the RRF for production.

Return type:

None

async set_tasks(instructions_template=None, task_description=None)#

Initialize question generation instructions template.

This sets the task description for the RRF. Either sets a custom template or generates one from task description using LLM. For most users, LLM generation is recommended over custom templates.

Parameters:

instructions_template (str | None) – Custom template to use. Must contain ‘<number_of_questions>’ tag. If None, generates template from task_description using LLM.
task_description (str | None) – Description of classification task to help LLM generate the template.

Return type:

str

Returns:

The question generation instructions template.

Raises:

ValueError – If template missing required tag or generation fails.
AssertionError – If both parameters are None.

async update_question_exclusion(question_id, exclusion)#

Update a question exclusion.

Parameters:

question_id (str) – The id of the question to update.
exclusion (QuestionExclusion | None) – The exclusion to set. If None, removes the exclusion status.

Returns:

The question that was updated.

Return type:

str

Raises:

ValueError – If question id is not found.

property llm_semaphore_limit: int#

property question_gen_instructions_template: str | None#: Get the question generation instructions template.

property task_description: str | None#: Get the task description.

property token_usage: TokenCounter#: Get the token counter for the RRF.

Type aliases#

EmbeddingModel#: alias of str | Literal[‘hashed_bag_of_words’]

AnsSimilarityFunc#: alias of str | Literal[‘jaccard’, ‘hamming’]

Random Rule Forest (RRF)#

Module contents#

Type aliases#

This Page