Self-Consistency - Guillaume GUERARD

Page contents

Self-Consistency, to consolidate the chain of thought

Self-consistency is perhaps one of the most advanced techniques for rapid engineering. Proposed by Wang et al. (2022), self-consistency aims “to replace the naive and greedy decoding used in thought chain prompting.” The idea is to sample multiple diverse reasoning paths through a CoT in a few moves and use generations to select the most consistent response. This helps improve the performance of CoT prompts on tasks involving arithmetic and commonsense reasoning.

For complex reasoning tasks with multiple valid paths, self-consistency generates diverse reasoning chains by sampling from the language model decoder. It then identifies the most consistent final answer by marginalizing these sampled chains. This approach capitalizes on the observation that problems requiring thoughtful analysis often result in greater diversity of reasoning, leading to a solution.

The combination of self-consistency and thought chain results in significant accuracy improvements on various benchmarks, such as 17.9 % on GSM8K, 11.0 % on SVAMP, 12.2 % on AQuA, 6.4 % on StrategyQA, and 3 to 9 % on ARC Challenge compared to the baseline thought chain incentive.

Several answers to make a general case

Self-consistency is an approach that simply asks a model the same prompt multiple times and takes the result majority as the final response. This is a follow-up to CoT prompts and is most powerful when used in conjunction with this one.

Let’s take a simple example of email analytics. Let’s say you’re a software company and you receive hundreds of emails per day. You want to use a model to classify emails as important or not, so you can prioritize the ones that can have a major impact on your business.

Here's an example of an email you might receive. Let's put that into a prompt:

By generating many thought chains and taking the most common (IMPORTANT) answer, we can arrive at a correct answer more consistently.

Self-consistency has been shown to improve performance on arithmetic, common sense, and symbolic reasoning tasks. Even when regular CoT was ineffective, self-consistency was still able to improve performance.

Code for Self-Consistency

Here's a recap of how it works:

To use autoconsistency, it is recommended to use a script because LLMs with interface do not allow this option. You will find the Python code to use autoconsistency:

import logging from typing import List, Dict from difflib import SequenceMatcher logger = logging.getLogger(__name__) class AdvancedSelfConsistency: def __init__(self, client, model: str, num_samples: int = 5, similarity_threshold: float = 0.8): self.client = client self.model = model self.num_samples = num_samples self.similarity_threshold = similarity_threshold self.self_consistency_completion_tokens = 0 def generate_responses(self, system_prompt: str, user_prompt: str) -> List[str]: responses = [] for _ in range (self.num_samples): response = self.client.chat.completions.create( model=self.model, messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt} ], temperature=1, max_tokens=4096 ) self.self_consistency_completion_tokens += response.usage.completion_tokens responses.append(response.choices[0].message.content) return responses def calculate_similarity(self, a: str, b: str) -> float: return SequenceMatcher(None, a, b).ratio() def cluster_similar_responses(self, responses: List[str]) -> List[List[ str]]: clusters = [] for response in responses: added_to_cluster = False for cluster in clusters: if self.calculate_similarity(response, cluster[0]) >= self.similarity_threshold: cluster.append(response) added_to_cluster = True break if not added_to_cluster: clusters.append([response]) return clusters def aggregate_results(self, responses: List[str]) -> Dict[str, any]: final_answers = responses clusters = self.cluster_similar_responses(final_answers) cluster_info = [] for cluster in clusters : cluster_info.append({ "answer": cluster[0], "frequency": len(cluster), "variants": cluster }) cluster_info.sort(key=lambda x: x['frequency'], reverse=True ) return { "clusters": cluster_info, "total_responses": len(responses), "num_unique_clusters": len(clusters) } def evaluate(self, system_prompt: str, user_prompt: str) -> Dict[str, any]: responses = self.generate_responses(system_prompt, user_prompt) aggregated_result = self.aggregate_results(responses) return { "individual_responses": responses, "aggregated_result": aggregated_result } def advanced_self_consistency_approach(system_prompt: str, initial_query: str, client, model: str) -> str: self_consistency = AdvancedSelfConsistency(client, model) result = self_consistency.evaluate(system_prompt, initial_query) logger.info("Advanced Self-Consistency Results:") logger.info(f"Total responses: {result['aggregated_result']['total_responses']}") logger.info( f"Number of unique clusters: {result['aggregated_result']['num_unique_clusters']}") for i, cluster in enumerate(result['aggregated_result']['clusters'], 1): logger.debug(f"\nCluster {i}:") logger.debug(f" Representative answer: {cluster['answer']}") logger.debug(f" Frequency: {cluster['frequency']}" ) logger.debug(f" Variants: {cluster['variants']}") if result['aggregated_result']['clusters']: return result['aggregated_result']['clusters'][0]['answer '], self_consistency.self_consistency_completion_tokens else: return "No consistent answer found.", self_consistency.self_consistency_completion_tokens

Self-consistency on multiLLM

Given a reflection chain, multiple LLMs can be used to ensure consistency between different LLMs, improving the accuracy of the final result. Initially, the same reflection chain is passed to multiple different LLMs (GPT4, PaLM2, etc.). Then, the quorum is determined, in this example via GPT4 as a quorum evaluator, but can be programmed using other methodologies.

The code is available on this GitHub.

Self-consistency best practice

There method self-consistency involves three steps. First, prompt a language model using the CoT prompt, then replace the “greedy decoding” (1-Best) in the CoT prompt with sampling from the language model’s decoder to generate a diverse set of reasoning paths, and finally, marginalize the reasoning paths and aggregate by choosing the most consistent answer from the final answer set.

It is worth noting that self-consistency can be seamlessly integrated into most sampling algorithms, including but not limited to temperature sampling, top-k sampling, and kernel sampling.

However, such an operation may require invoking the model API to tune these hyperparameters. In light of this, an alternative approach might be to allow the model to generate results using a variety of reasoning paths, and then generate a diverse set of candidate reasoning paths.

The answer demonstrating the highest degree of consistency between the different reasoning trajectories is then more likely to represent the correct solution. Self-consistency improves performance in arithmetic, commonsense, and symbolic reasoning tasks. Moreover, in practice, self-consistency can be combined with other techniques to further improve model performance. Combining self-consistency with a discriminator-guided multi-stage reasoning approach has been found to significantly improve the model's reasoning capabilities.