GPTree#

Module contents#

A decision tree classifier employing LLMs for dynamic feature generation.

Each node uses language models to generate contextual questions and evaluate answers, enabling adaptive tree construction for classification tasks.

class GPTree(qgen_llmc, critic_llmc, qgen_instr_llmc, qanswer_llmc=None, qgen_temperature=0.0, critic_temperature=0.0, qgen_instr_gen_temperature=0.0, qanswer_temperature=0.0, criterion='gini', max_depth=None, max_node_width=3, min_samples_leaf=1, llm_semaphore_limit=5, min_question_candidates=3, max_question_candidates=10, expert_advice=None, n_samples_as_context=30, class_ratio='balanced', use_critic=False, save_path=None, name=None, random_state=None)#

Bases: object

LLM based decision tree classifier.

Note that GPTree auto saves the tree after each node is built.

Parameters:
classmethod load(path)#

Load a GPTree from saved state.

Parameters:

path (str | PathLike[str]) – Directory containing “gptree.json” or the JSON file directly.

Return type:

GPTree

Returns:

Reconstructed GPTree instance.

advice(advice)#

Set context/advice for question generations.

Parameters:

advice (str | None) – The advice to set. If None, the advice is cleared.

Return type:

Literal['Advice taken', 'Advice cleared']

Returns:

“Advice taken” if advice is set, “Advice cleared” if advice is cleared.

async fit(X=None, y=None, *, copy_data=True, reset=False)#

Train or resume tree construction as an async generator.

Parameters:
  • X (DataFrame | None) – Training features. Required on first run or with reset=True.

  • y (Optional[Sequence[str]]) – Training labels. Required on first run or with reset=True.

  • copy_data (bool) – Whether to copy input data.

  • reset (bool) – Clear existing state and restart from root.

Yields:

Node – Updated nodes during tree construction.

Raises:

ValueError – If data requirements aren’t met or invalid reset usage.

Return type:

AsyncGenerator[Node, None]

get_node(node_id)#

Get the node by id.

Return type:

Node | None

Parameters:

node_id (int)

get_questions()#

Get all questions generated in the tree.

Return type:

DataFrame | None

get_root_id()#

Get the root node id.

Return type:

int | None

get_training_data()#

Get the training data.

Return type:

DataFrame | None

async predict(samples)#

Predict labels for samples with concurrent processing.

Parameters:

samples (DataFrame) – DataFrame with single column matching training data format.

Yields:

Tuple of (sample_index, question, answer, node_id, token_usage)

Return type:

AsyncGenerator[Tuple[int, str, str, int, TokenCounter], None]

prune_tree(node_id)#

Prune the tree from the node with the given ID.

Parameters:

node_id (int) – The ID of the node to prune.

Raises:
  • ValueError – If the node with the given ID is not found on the tree.

  • ValueError – If the node with the given ID is a leaf node.

Return type:

None

async resume_fit(node_id)#

Enqueue a node to resume (re)building its subtree from current data.

Typical usage:
  • Call prune_tree(node_id) to clear the subtree

  • await resume_fit(node_id) to continue building from that node

Parameters:

node_id (int) – The ID of the node to resume building from.

Yields:

Node – Updated nodes during tree construction.

Raises:
  • ValueError – If the node with the given ID is not found on the tree.

  • ValueError – If the tree has no training data loaded.

Return type:

AsyncGenerator[Node, None]

save(dir_path=None, for_production=False)#

Save model config to JSON and dataframes to parquet in a directory.

If dir_path is None, uses <self.save_path>/<self.name>. If for_production is True, does not save the training dataframe.

Parameters:
  • dir_path (str | PathLike[str] | None) – The directory to save the tree to.

  • for_production (bool) – Whether to save the tree for production.

Return type:

None

async set_tasks(instructions_template=None, task_description=None)#

Initialize question generation instructions template.

This sets the task description for the tree. Either sets a custom template or generates one from task description using LLM. For most users, LLM generation is recommended over custom templates.

Parameters:
  • instructions_template (str | None) – Custom template to use. Must contain ‘<number_of_questions>’ tag. If None, generates template from task_description using LLM.

  • task_description (str | None) – Description of classification task to help LLM generate the template.

Return type:

str

Returns:

The question generation instructions template.

Raises:
stop()#

Stop training process.

Return type:

None

view_node(node_id, format='png', add_all_questions=False, truncate_length=140)#

Render subtree rooted at node_id as PNG/SVG bytes.

Parameters:
  • node_id (int) – Root node ID for the subtree visualization.

  • format (Literal['png', 'svg']) – Output image format (‘png’ or ‘svg’).

  • add_all_questions (bool) – Include all generated questions in node display.

  • truncate_length (int | None) – Maximum text length before truncation. None disables truncation.

Return type:

bytes

Returns:

Rendered subtree image data as bytes.

Raises:
  • ValueError – If node_id doesn’t exist in the tree.

  • ImportError – If graphviz package is not installed.

property classes: List[str] | None#

Classes the tree is trying to classify.

property critic_instructions_template: str | None#

Get the critic instructions template.

property expert_advice: str | None#

Expert advice set on the tree.

property question_gen_instructions_template: str | None#

Get the question generation instructions template.

property task_description: str | None#

Description of the classification task.

property token_usage: TokenCounter#

Get the token counter for the GPTree.

class Node(id, label, question=None, questions=<factory>, cumulative_memory=None, split_ratios=None, gini=0.0, class_distribution=<factory>, children=<factory>, parent_id=None)#

Bases: object

A Node represents a decision point in GPTree.

Parameters:
classmethod from_dict(d)#

Convert a dictionary to a node.

Return type:

Node

Parameters:

d (Dict[str, Any])

children: List[Node]#
class_distribution: Dict[str, int]#
cumulative_memory: str | None#
gini: float#
id: int#
property is_leaf: bool#

Check if the node is a leaf node.

label: str#
parent_id: int | None#
question: NodeQuestion | None#
questions: List[NodeQuestion]#
split_ratios: Optional[Tuple[int, ...]]#
class NodeQuestion(value, choices, question_type, df_column=<factory>, score=None)#

Bases: object

A question for generated at a node.

Parameters:
classmethod from_dict(d)#

Convert a dictionary to a node question.

Return type:

NodeQuestion

Parameters:

d (Dict[str, Any])

to_dict()#

Convert the node question to a dictionary.

Return type:

Dict[str, Any]

choices: List[str]#
df_column: str#
question_type: Literal['INFERENCE', 'CODE']#
score: float | None#
value: str#

Type aliases#

QuestionType#

alias of Literal[‘INFERENCE’, ‘CODE’]

Criterion#

alias of Literal[‘gini’]