Random Rule Forest (RRF)#

Module contents#

Random Rule Forest.

An interpretable ensemble framework for binary classification based on YES/NO questions generated by LLMs.

class QuestionExclusion(*values)#

Bases: StrEnum

EXPERT = 'expert'#
PREDICTION_SIMILARITY = 'prediction_similarity'#
SEMANTICS = 'semantics'#
class RRF(qgen_llmc, qanswer_llmc=None, qgen_temperature=0.0, qanswer_temperature=0.0, llm_semaphore_limit=3, answer_similarity_func='hamming', max_generated_questions=100, max_samples_as_context=30, class_ratio=(1.0, 1.0), q_answer_update_interval=10, save_path=None, name=None, random_state=42)#

Bases: object

Interpretable ensemble binary classifier.

Parameters:
classmethod load(dir_path)#

Load an RRF saved by save.

Return type:

RRF

Parameters:

dir_path (str | PathLike[str])

async add_question(question)#

Add a question to the RRF.

Parameters:

question (str) – The question to add.

Raises:

ValueError – If question already exists.

Return type:

Literal[True]

filter_questions_on_pred_similarity(threshold)#

Filter questions on prediction similarity.

If two questions have a prediction similarity greater than or equal to the threshold, the question with the lower f1 score is excluded.

Parameters:
  • threshold (float | None) – Threshold for prediction similarity. If None, no filtering is

  • similarity. (done based on prediction)

Raises:

AssertionError – If threshold is not between > 0 and <= 1.

Return type:

None

async filter_questions_on_semantics(threshold, emb_model)#

Filter questions on semantics.

If two questions have a semantic similarity greater than or equal to the threshold, the question with the lower f1 score is excluded.

Parameters:
  • threshold (float | None) – Threshold for semantic filtering. If None, no filtering is

  • similarity. (done based on semantic)

  • emb_model (Union[str, Literal['hashed_bag_of_words']]) – Embedding model to use for semantic filtering.

Raises:

AssertionError – If threshold is not between > 0 and <= 1.

Return type:

None

async fit(X=None, y=None, *, copy_data=True, reset=False)#

Fit the RRF to the data.

Parameters:
  • X (DataFrame | None) – Training features. Required on first run or with reset=True.

  • y (Optional[Sequence[str]]) – Training labels. Required on first run or with reset=True.

  • copy_data (bool) – Whether to copy input data.

  • reset (bool) – Clear existing state and restart forest generation.

Returns:

Updated RRF.

Return type:

Self

Raises:

ValueError – If data requirements aren’t met or invalid reset usage.

get_answers()#

Get answers dataframe.

Returns:

  • Columns as the questions ids (not excluded during semantic filtering).

  • Index as the samples indices from self._X.

Return type:

DataFrame containing

get_questions()#

Get generated questions dataframe.

Returns:

  • question: Generated question text

  • embedding: Question embedding vectors

  • exclusion: Question exclusion principle

  • precision: Precision for each question

  • recall: Recall for each question

  • f1_score: F1 score for each question

  • accuracy: Accuracy for each question

Return type:

DataFrame containing

async predict(samples)#

Predict labels for samples.

Parameters:

samples (DataFrame) – Samples to predict.

Return type:

AsyncGenerator[Tuple[Any, str, Literal['YES', 'NO'], TokenCounter], None]

Returns:

Generator of predictions[sample_index, question, answer, token_counter]

Raises:

ValueError – If samples is empty or does not have the correct column.

save(dir_path=None, for_production=False)#

Save model config to JSON and dataframes to parquet in a directory.

If dir_path is None, uses <self.save_path>/<self.name>. If for_production is True, strips the questions dataframe and does not save the answers and training dataframes.

Parameters:
  • dir_path (str | PathLike[str] | None) – The directory to save the RRF to.

  • for_production (bool) – Whether to save the RRF for production.

Return type:

None

async set_tasks(instructions_template=None, task_description=None)#

Initialize question generation instructions template.

This sets the task description for the RRF. Either sets a custom template or generates one from task description using LLM. For most users, LLM generation is recommended over custom templates.

Parameters:
  • instructions_template (str | None) – Custom template to use. Must contain ‘<number_of_questions>’ tag. If None, generates template from task_description using LLM.

  • task_description (str | None) – Description of classification task to help LLM generate the template.

Return type:

str

Returns:

The question generation instructions template.

Raises:
async update_question_exclusion(question_id, exclusion)#

Update a question exclusion.

Parameters:
  • question_id (str) – The id of the question to update.

  • exclusion (QuestionExclusion | None) – The exclusion to set. If None, removes the exclusion status.

Returns:

The question that was updated.

Return type:

str

Raises:

ValueError – If question id is not found.

property llm_semaphore_limit: int#
property question_gen_instructions_template: str | None#

Get the question generation instructions template.

property task_description: str | None#

Get the task description.

property token_usage: TokenCounter#

Get the token counter for the RRF.

Type aliases#

EmbeddingModel#

alias of str | Literal[‘hashed_bag_of_words’]

AnsSimilarityFunc#

alias of str | Literal[‘jaccard’, ‘hamming’]