A.I. Integration

Avenue6510 · Post by **Avenue6510** » Fri Apr 18, 2025 3:12 am

It would be highly beneficial for reference management if Bookends integrated AI tools. One efficient approach would be to let users provide an API key for OpenAI (the creators of ChatGPT) or similar providers, enabling access to large language models and related services. OpenAI offers 1 GB of free vector storage per account—enough to store embeddings for over 5,000 research papers. Below is a proof-of-concept script that demonstrates this idea. It runs as a standalone programme and requires only the path to the Bookends attachments folder, an API key, and a user query.

For the query

How do you prove stability of the discontinuous Galerkin method for hyperbolic conservation laws? For each approach let me know which papers are relevant and what the main idea is. If there are multiple approaches, let me know which one is the most common and why.

it will for example search my research papers of the Bookends attachment folder and return the answer:

Files used: {'Lepsky2000.pdf', 'Li2010.pdf', 'Houston2002a.pdf', 'Cockburn2001.pdf', 'Zhu2011.pdf'}
Response:
Proving the stability of the discontinuous Galerkin (DG) method for hyperbolic conservation laws can be approached through several methods. Here are the primary approaches and their relevant literature:

### 1. **Energy Methods**
- **Main Idea:** This approach uses energy estimates to derive stability results. By defining an appropriate energy norm, one can establish bounds that show the solutions remain stable over time.
- **Relevant Papers:**
- **Cockburn and Shu (1990)**: "TVB Runge–Kutta local projection discontinuous Galerkin finite elements for hyperbolic conservation laws." This paper discusses stability in the framework of energy methods and presents a framework for the generalized local projection DG method.
- **Houston et al. (2000)**: Their work on stabilized finite element methods for first-order hyperbolic problems also uses energy estimates and shows optimal convergence.

### 2. **Streamline-Diffusion Methods**
- **Main Idea:** This approach introduces an additional numerical diffusion term specific to the characteristics of the problem. This helps to control oscillations and enhance stability.
- **Relevant Papers:**
- **Bey and Oden (1996)**: Their work on hp-version discontinuous Galerkin methods addresses the introduction of streamline diffusion to stabilize the numerical scheme.
- **Houston et al. (2002)**: They explore a family of hp-version DG methods with least-squares stabilization, demonstrating optimal error bounds and convergence rates.

### 3. **A Posteriori Error Analysis**
- **Main Idea:** This method entails using error estimators to adaptively refine the mesh and time step, ensuring stability through controlled approximation.
- **Relevant Papers:**
- **Houston et al. (1999)**: This work analyses a posteriori error estimates for numerical approximations, focusing on stabilizing the solution near discontinuities.
- **Li and Shu (2010)**: Discusses a central DG method for Hamilton-Jacobi equations that ensures stability and optimal error estimates with a posteriori strategies.

### 4. **Local Lax-Friedrichs Fluxes**
- **Main Idea:** By employing proper fluxes at the interfaces of elements, one guarantees that the stability conditions are satisfied for the entire domain.
- **Relevant Papers:**
- **Cockburn and Shu (1998)** discuss local discontinuous Galerkin methods for Euler equations, explaining the use of Lax-Friedrichs fluxes to maintain stability.

### Most Common Approach
The **energy methods** combined with **streamline-diffusion techniques** are among the most common approaches due to their robustness and ease of implementation across various applications. These methods effectively manage the stability and convergence properties of DG methods, particularly when dealing with complex hyperbolic systems.

### Conclusion
The choice of method often depends on the specific application and the nature of the hyperbolic equations involved. As of now, energy methods with tailored numerical diffusion mechanisms are prominently accepted for their comprehensive applicability and rigor in ensuring stability.

Below is the code. Observe that the vector store only needs to be created once, and the papers must only be uploaded once. The code is an adaptation of https://cookbook.openai.com/examples/fi ... _responses.

Code: Select all

from openai import OpenAI
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm
import concurrent
import PyPDF2
import os
import pandas as pd
import base64
import keyring

###########################################
# Script Purpose:
# This script uploads PDF research papers from Bookends' attachments folder to a vector store,
# enabling question answering using LLM + RAG (Retrieval-Augmented Generation).
###########################################

###########################################
# Boolean switches and configurations
###########################################
# These switches control the flow of the script
b_create_vector_store = True  # Set to True to create a new vector store
b_upload_pdfs = True  # Set to True to upload PDFs to the vector store
b_search_vector_store = False  # Set to True to search the vector store
b_integrate_with_llm = True  # Set to True to integrate with the LLM for responses
keyring_username = "your-username"  # Replace with your keyring username
store_name = "research-papers"
dir_pdfs = '/path/to/Bookends/Attachments/'  # Replace by your Bookends attachments folder path
query = "How do you prove stability of the discontinuous Galerkin method for hyperbolic conservation laws? For each approach let me know which papers are relevant and what the main idea is. If there are multiple approaches, let me know which one is the most common and why."

###########################################
# OpenAI API Key Management
###########################################
def get_openai_api_key(keyring_username: str) -> str:
    """
    Retrieve the OpenAI API key from environment or keyring.
    This function assumes that the API key is saved in the Apple keychain using the `keyring` library.
    Raises a ValueError if no key is found.

    To generate and save the OpenAI API key in the keychain:
    1. Log in to your OpenAI account and navigate to the API Keys section.
    2. Generate a new API key and copy it.
    3. Use the `keyring` library to save the key in the keychain:
       Example:
       ```python
       keyring.set_password("openai", "your-username", "your-api-key")
       ```
    Replace "your-username" with your keyring username and "your-api-key" with the generated API key.
    """
    api_key = os.getenv("OPENAI_API_KEY") or keyring.get_password("openai", keyring_username)
    if not api_key:
        raise ValueError("No OpenAI API token found in environment or keyring.")
    return api_key

###########################################
# PDF upload
###########################################
def upload_single_pdf(file_path: str, vector_store_id: str):
    """
    Upload a single PDF file to the vector store.
    Returns a status indicating success or failure.
    """
    file_name = os.path.basename(file_path)
    try:
        file_response = client.files.create(file=open(file_path, 'rb'), purpose="assistants")
        attach_response = client.vector_stores.files.create(
            vector_store_id=vector_store_id,
            file_id=file_response.id
        )
        return {"file": file_name, "status": "success"}
    except Exception as e:
        print(f"Error with {file_name}: {str(e)}")
        return {"file": file_name, "status": "failed", "error": str(e)}

def upload_pdf_files_to_vector_store(vector_store_id: str, pdf_files: list):
    """
    Upload multiple PDF files to the vector store in parallel and track success/failure.
    Returns a dictionary with upload statistics.
    """
    stats = {"total_files": len(pdf_files), "successful_uploads": 0, "failed_uploads": 0, "errors": []}
    
    print(f"{len(pdf_files)} PDF files to process. Uploading in parallel...")

    with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
        futures = {executor.submit(upload_single_pdf, file_path, vector_store_id): file_path for file_path in pdf_files}
        for future in tqdm(concurrent.futures.as_completed(futures), total=len(pdf_files)):
            result = future.result()
            if result["status"] == "success":
                stats["successful_uploads"] += 1
            else:
                stats["failed_uploads"] += 1
                stats["errors"].append(result)

    return stats

###########################################
# Create and load vector store
###########################################
def create_vector_store(store_name: str) -> dict:
    """
    Create a new vector store with the given name.
    Returns the details of the created vector store.
    """
    try:
        vector_store = client.vector_stores.create(name=store_name)
        details = {
            "id": vector_store.id,
            "name": vector_store.name,
            "created_at": vector_store.created_at,
            "file_count": vector_store.file_counts.completed
        }
        print("Vector store created:", details)
        return details
    except Exception as e:
        print(f"Error creating vector store: {e}")
        return {}
    
def load_vector_store(store_name: str) -> dict:
    """
    Load an existing vector store by name.
    Returns the details of the loaded vector store.
    """
    try:
        vector_stores = client.vector_stores.list()
        for store in vector_stores.data:
            if store.name == store_name:
                details = {
                    "id": store.id,
                    "name": store.name,
                    "created_at": store.created_at,
                    "file_count": store.file_counts.completed
                }
                print("Vector store loaded:", details)
                return details
        print(f"Vector store '{store_name}' not found.")
        return {}
    except Exception as e:
        print(f"Error loading vector store: {e}")
        return {}

###########################################
# Main function to run the script
###########################################
if __name__ == "__main__":
    # Initialize OpenAI client
    client = OpenAI(api_key=get_openai_api_key(keyring_username))  # Generalized keyring username

    # Create or load vector store
    if b_create_vector_store:
        vector_store_details = create_vector_store(store_name)
    else:
        vector_store_details = load_vector_store(store_name)
    
    # Upload PDF files to vector store
    pdf_files = [os.path.join(dir_pdfs, f) for f in os.listdir(dir_pdfs) if f.lower().endswith('.pdf')]
    print(f"Found {len(pdf_files)} PDF files in {dir_pdfs}.")
    if b_upload_pdfs:
        upload_pdf_files_to_vector_store(vector_store_details["id"], pdf_files)

    # Search in vector store
    if b_search_vector_store:
        search_results = client.vector_stores.search(
            vector_store_id=vector_store_details['id'],
            query=query
        )

        # Print search results
        print(f"Found {len(search_results.data)} results for query '{query}':")
        for result in search_results.data:
            print(str(len(result.content[0].text)) + ' of character of content from ' + result.filename + ' with a relevant score of ' + str(result.score))

    # Integrate with LLM
    if b_integrate_with_llm:             
        response = client.responses.create(
            input= query,
            model="gpt-4o-mini",
            tools=[{
                "type": "file_search",
                "vector_store_ids": [vector_store_details['id']],
            }],
            include=["file_search_call.results"]
        )

        # Extract annotations from the response
        annotations = response.output[1].content[0].annotations
    
        # Get top-k retrieved filenames
        retrieved_files = set([result.filename for result in annotations])

        print(f'Files used: {retrieved_files}')
        print('Response:')
        print(response.output[1].content[0].text) # 0 being the filesearch call

vinschger · Post by **vinschger** » Thu May 01, 2025 7:59 am

I have no prior experience with vector stores. Is it necessary to define the research question before constructing the vector storage, or can multiple questions be applied to an existing vector storage? If PDFs are added at a later stage, does the entire storage need to be rebuilt? This sounds intriguing!

johnlaudun · Post by **johnlaudun** » Fri May 09, 2025 6:07 pm

I'm glad to see this query. And I get how tricky this is going to be to implement.

What I have in mind is a little bit less than loading all of my documents. Instead, working with an LLM on my local machine, I would like to set up either a context or an RAG based on a subset of documents. So, for example, I'm a folklorist and there are a number of domains with which my field interacts/intersects. I don't necessarily want to include all of that as context for a query that is strictly about, let's say, memory.

So, I'd like to be able to have Bookends find the relevant texts in my collection and then, I think, select them in the Finder so that I can then drag them into something like AnythingLLM. To my mind, Select all marked entries in Finder would be a perfectly workaround to the larger problem that we are all, users and developers alike, facing.

Sonny Software

A.I. Integration

A.I. Integration

Re: A.I. Integration

Re: A.I. Integration