Self-consistency is perhaps one of the most advanced techniques for rapid engineering. Proposed by Wang et al. (2022), self-consistency aims "to replace the naive and greedy decoding used in thought chain prompts." The idea is to sample several diverse reasoning paths through a CoT in a few iterations and use the generations to select the most consistent response. This helps improve the performance of CoT prompts on tasks involving arithmetic and common-sense reasoning.
For complex reasoning tasks with multiple valid paths, self-consistency generates diverse reasoning chains by sampling from the language model decoder. It then identifies the most consistent final answer by marginalizing these sampled chains. This approach capitalizes on the observation that problems requiring thoughtful analysis often result in greater diversity of reasoning, leading to a solution.
The combination of self-consistency and thought chain leads to significant improvements in accuracy on various benchmarks, such as 17.9 % on GSM8K, 11.0 % on SVAMP, 12.2 % on AQuA, 6.4 % on StrategyQA and 3 to 9 % on Défi ARC compared to basic thought chain incentive.
Self-consistency is an approach that simply asks a model the same prompt multiple times and takes the result The majority of responses are used as the final answer. This is a follow-up to CoT prompts and is more powerful when used in conjunction with them.
Let's take a simple example of email analysis. Suppose you're a software publisher and you receive hundreds of emails a day. You want to use a template to categorize emails as important or not, so you can prioritize those that could have a major impact on your business.
Here's an example of an email you might receive. Let's put this in a prompt:
By generating many thought chains and taking the most common (IMPORTANT) answer, we can arrive at a correct answer more consistently.
Self-consistency has been shown to improve performance on arithmetic, common sense, and symbolic reasoning tasks. Even when regular CoT was ineffective, self-consistency was still able to improve performance.
Here's a recap of how it works:
To use autoconsistency, it is recommended to use a script because LLMs with interfaces do not allow this option. You will find the Python code to use autoconsistency here:
import logging from typing import List, Dict from difflib import SequenceMatcher logger = logging.getLogger(__name__) class AdvancedSelfConsistency: def __init__(self, client, model: str, num_samples: int = 5, similarity_threshold: float = 0.8): self.client = client self.model = model self.num_samples = num_samples self.similarity_threshold = similarity_threshold self.self_consistency_completion_tokens = 0 def generate_responses(self, system_prompt: str, user_prompt: str) -> List[str]: responses = [] for _ in range (self.num_samples): response = self.client.chat.completions.create( model=self.model, messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt} ], temperature=1, max_tokens=4096 ) self.self_consistency_completion_tokens += response.usage.completion_tokens responses.append(response.choices[0].message.content) return responses def calculate_similarity(self, a: str, b: str) -> float: return SequenceMatcher(None, a, b).ratio() def cluster_similar_responses(self, responses: List[str]) -> List[List[ str]]: clusters = [] for response in responses: added_to_cluster = False for cluster in clusters: if self.calculate_similarity(response, cluster[0]) >= self.similarity_threshold: cluster.append(response) added_to_cluster = True break if not added_to_cluster: clusters.append([response]) return clusters def aggregate_results(self, responses: List[str]) -> Dict[str, any]: final_answers = responses clusters = self.cluster_similar_responses(final_answers) cluster_info = [] for cluster in clusters : cluster_info.append({ "answer": cluster[0], "frequency": len(cluster), "variants": cluster }) cluster_info.sort(key=lambda x: x['frequency'], reverse=True ) return { "clusters": cluster_info, "total_responses": len(responses), "num_unique_clusters": len(clusters) } def evaluate(self, system_prompt: str, user_prompt: str) -> Dict[str, any]: responses = self.generate_responses(system_prompt, user_prompt) aggregated_result = self.aggregate_results(responses) return { "individual_responses": responses, "aggregated_result": aggregated_result } def advanced_self_consistency_approach(system_prompt: str, initial_query: str, client, model: str) -> str: self_consistency = AdvancedSelfConsistency(client, model) result = self_consistency.evaluate(system_prompt, initial_query) logger.info("Advanced Self-Consistency Results:") logger.info(f"Total responses: {result['aggregated_result']['total_responses']}") logger.info( f"Number of unique clusters: {result['aggregated_result']['num_unique_clusters']}") for i, cluster in enumerate(result['aggregated_result']['clusters'], 1): logger.debug(f"\nCluster {i}:") logger.debug(f" Representative answer: {cluster['answer']}") logger.debug(f" Frequency: {cluster['frequency']}" ) logger.debug(f" Variants: {cluster['variants']}") if result['aggregated_result']['clusters']: return result['aggregated_result']['clusters'][0]['answer '], self_consistency.self_consistency_completion_tokens else: return "No consistent answer found.", self_consistency.self_consistency_completion_tokens Given a reflection chain, multiple LLMs can be used to ensure consistency between the different LLMs, thus improving the accuracy of the final result. Initially, the same reflection chain is passed to several different LLMs (GPT4, PaLM2, etc.). Then, the quorum is determined, in this example via GPT4 as a quorum evaluator, but it can be programmed using other methodologies.
The code is available on this GitHub.
There method Self-consistency involves three steps. First, prompt a language model using the CoT prompt, then replace the greedy (1-Best) decoding in the CoT prompt with sampling from the language model's decoder to generate a diverse set of reasoning paths, and finally, marginalize the reasoning paths and aggregate by choosing the most consistent response from the final response set.
It is worth noting that self-consistency can be seamlessly integrated into most sampling algorithms, including but not limited to temperature sampling, top-k sampling, and kernel sampling.
However, such an operation may require invoking the model API to tune these hyperparameters. In light of this, an alternative approach might be to allow the model to generate results using a variety of reasoning paths, and then generate a diverse set of candidate reasoning paths.
The answer demonstrating the highest degree of consistency between the different reasoning paths is then more likely to represent the correct solution. Self-consistency improves performance in arithmetic, common-sense, and symbolic reasoning tasks. Furthermore, in practice, self-consistency can be combined with other techniques to further enhance model performance. It has been found that combining self-consistency with a multi-step reasoning approach guided by a discriminator significantly improves the model's reasoning capabilities.