Random Rule Forest (RRF)#
Module contents#
Random Rule Forest.
An interpretable ensemble framework for binary classification based on YES/NO questions generated by LLMs.
- class QuestionExclusion(*values)#
Bases:
StrEnum
- EXPERT = 'expert'#
- PREDICTION_SIMILARITY = 'prediction_similarity'#
- SEMANTICS = 'semantics'#
- class RRF(qgen_llmc, qanswer_llmc=None, qgen_temperature=0.0, qanswer_temperature=0.0, llm_semaphore_limit=3, answer_similarity_func='hamming', max_generated_questions=100, max_samples_as_context=30, class_ratio=(1.0, 1.0), q_answer_update_interval=10, save_path=None, name=None, random_state=42)#
Bases:
object
Interpretable ensemble binary classifier.
- Parameters:
qgen_llmc (
List
[Union
[AnthropicChoice
,GoogleChoice
,OpenAIChoice
,XAIChoice
,AnthropicChoiceDict
,GoogleChoiceDict
,OpenAIChoiceDict
,XAIChoiceDict
]]) – LLMs to use for question generation, in priority order.qanswer_llmc (
Optional
[List
[Union
[AnthropicChoice
,GoogleChoice
,OpenAIChoice
,XAIChoice
,AnthropicChoiceDict
,GoogleChoiceDict
,OpenAIChoiceDict
,XAIChoiceDict
]]]) – LLMs to use for answering questions, in priority order. If None, use qgen_llmc.qgen_temperature (
float
) – Sampling temperature for question generation.qanswer_temperature (
float
) – Sampling temperature for answering questions.llm_semaphore_limit (
int
) – Max concurrent LLM calls.answer_similarity_func (
Union
[str
,Literal
['jaccard'
,'hamming'
]]) – Function to use for answer similarity.max_generated_questions (
int
) – Maximum number of questions to generate. Max 1000max_samples_as_context (
int
) – Number of samples used as context in a round of question generation. max 100 Max 100class_ratio (
Tuple
[float
,float
]) – Ratio of YES to NO samples to use as context in a round of question generation.q_answer_update_interval (
int
) – Logging interval of question answering.save_path (
str
|PathLike
[str
] |None
) – Directory to save checkpoints/models.random_state (
int
) – Random seed.
- classmethod load(dir_path)#
Load an RRF saved by save.
- async add_question(question)#
Add a question to the RRF.
- Parameters:
question (
str
) – The question to add.- Raises:
ValueError – If question already exists.
- Return type:
Literal
[True
]
- filter_questions_on_pred_similarity(threshold)#
Filter questions on prediction similarity.
If two questions have a prediction similarity greater than or equal to the threshold, the question with the lower f1 score is excluded.
- Parameters:
- Raises:
AssertionError – If threshold is not between > 0 and <= 1.
- Return type:
- async filter_questions_on_semantics(threshold, emb_model)#
Filter questions on semantics.
If two questions have a semantic similarity greater than or equal to the threshold, the question with the lower f1 score is excluded.
- Parameters:
- Raises:
AssertionError – If threshold is not between > 0 and <= 1.
- Return type:
- async fit(X=None, y=None, *, copy_data=True, reset=False)#
Fit the RRF to the data.
- Parameters:
- Returns:
Updated RRF.
- Return type:
Self
- Raises:
ValueError – If data requirements aren’t met or invalid reset usage.
- get_answers()#
Get answers dataframe.
- Returns:
Columns as the questions ids (not excluded during semantic filtering).
Index as the samples indices from self._X.
- Return type:
DataFrame containing
- get_questions()#
Get generated questions dataframe.
- Returns:
question: Generated question text
embedding: Question embedding vectors
exclusion: Question exclusion principle
precision: Precision for each question
recall: Recall for each question
f1_score: F1 score for each question
accuracy: Accuracy for each question
- Return type:
DataFrame containing
- async predict(samples)#
Predict labels for samples.
- Parameters:
samples (
DataFrame
) – Samples to predict.- Return type:
AsyncGenerator
[Tuple
[Any
,str
,Literal
['YES'
,'NO'
],TokenCounter
],None
]- Returns:
Generator of predictions[sample_index, question, answer, token_counter]
- Raises:
ValueError – If samples is empty or does not have the correct column.
- save(dir_path=None, for_production=False)#
Save model config to JSON and dataframes to parquet in a directory.
If dir_path is None, uses <self.save_path>/<self.name>. If for_production is True, strips the questions dataframe and does not save the answers and training dataframes.
- async set_tasks(instructions_template=None, task_description=None)#
Initialize question generation instructions template.
This sets the task description for the RRF. Either sets a custom template or generates one from task description using LLM. For most users, LLM generation is recommended over custom templates.
- Parameters:
- Return type:
- Returns:
The question generation instructions template.
- Raises:
ValueError – If template missing required tag or generation fails.
AssertionError – If both parameters are None.
- async update_question_exclusion(question_id, exclusion)#
Update a question exclusion.
- Parameters:
question_id (
str
) – The id of the question to update.exclusion (
QuestionExclusion
|None
) – The exclusion to set. If None, removes the exclusion status.
- Returns:
The question that was updated.
- Return type:
- Raises:
ValueError – If question id is not found.
- property question_gen_instructions_template: str | None#
Get the question generation instructions template.
- property token_usage: TokenCounter#
Get the token counter for the RRF.