Document and Query Representation with TF and TF-IDF

Description

This project focuses on representing documents and queries using Term Frequency (TF) and Term Frequency-Inverse Document Frequency (TF-IDF) models. The main goal is to prepare matrices that represent the frequency and importance of terms in a collection of documents and queries. This project builds upon the results of previous laboratories and demonstrates advanced text processing techniques to enhance the efficiency of information retrieval within large text datasets.

Files Included

lab5-docs.py: A Python script for processing document texts to create TF and TF-IDF matrices.
lab5-query.py: A Python script for processing query texts to create TF and TF-IDF matrices.
Laboratorio 5 Representación tf y tf-idf.pdf: Official documentation detailing the objectives, methodology, and results of the project.
cueri-TRUNCADO.txt: A truncated version of query texts.
documento-TRUNCADO.txt: A truncated version of document texts.
IDF-docs.txt: A text file containing the IDF values for documents.
IDF-query.txt: A text file containing the IDF values for queries.
matriz-docs.txt: A text file containing the TF matrix for documents.
matriz-query.txt: A text file containing the TF matrix for queries.
matriz-TFIDF-docs.txt: A text file containing the TF-IDF matrix for documents.
matriz-TFIDF-query.txt: A text file containing the TF-IDF matrix for queries.
vocabularioReducidoT.txt: A text file containing the reduced vocabulary.

Notable Code Snippets

1. Vectorizing Documents (lab5-docs.py)

This snippet reads document texts and generates a TF matrix based on the reduced vocabulary.

def vectorizar_documentos(archivo_documentos, vocabulario):
    lineas = leer_archivo(archivo_documentos)
    matriz = []
    print("\nVECTORIZANDO DOCUMENTOS...")
    for linea in lineas:
        frecuencias = [0] * len(vocabulario)
        palabras = linea.strip().split()
        for palabra in palabras:
            if palabra in vocabulario:
                indice = vocabulario[palabra]
                frecuencias[indice] += 1
        matriz.append(frecuencias)
    return matriz

# Archivos de entrada
archivo_vocabulario = 'vocabularioReducidoT.txt'
archivo_documentos  = 'documento-TRUNCADO.txt'
archivo_salida      = 'matriz-docs.txt'

# Generar vocabulario
vocabulario = generar_vocabulario(archivo_vocabulario)

# Vectorizar documentos
matriz = vectorizar_documentos(archivo_documentos, vocabulario)

# Escribir matriz en archivo de salida
escribir_matriz(archivo_salida, matriz)

2. Calculating IDF (lab5-query.py)

This snippet calculates the IDF values for the terms in the query texts.

def calcular_idf(matriz):
    num_documentos = len(matriz)
    num_palabras = len(matriz[0])
    idf = [0] * num_palabras
    print("\nCALCULANDO IDF...")
    for i in range(num_palabras):
        num_documentos_con_palabra = sum([1 for documento in matriz if documento[i] > 0])
        idf[i] = math.log(num_documentos / num_documentos_con_palabra) + 1
    return idf

# Archivos de entrada
archivo_vocabulario = 'vocabularioReducidoT.txt'
archivo_documentos  = 'cueri-TRUNCADO.txt'
archivo_salida      = 'matriz-query.txt'

# Generar vocabulario
vocabulario = generar_vocabulario(archivo_vocabulario)

# Vectorizar documentos
matriz = vectorizar_documentos(archivo_documentos, vocabulario)

# Calcular IDF
idf = calcular_idf(matriz)

# Escribir IDF en archivo de salida
archivo_idf = 'IDF-query.txt'
escribir_vocabulario_idf(archivo_idf, vocabulario, idf)

Official Documentation Summary

The official documentation provided in "Laboratorio 5 Representación tf y tf-idf.pdf" outlines the following key points:

Objectives

Represent documents and queries using TF and TF-IDF models.
Understand the importance of term frequency and inverse document frequency in information retrieval.
Develop Python scripts to compute TF and TF-IDF matrices.

Methodology

Vectorization: Represent each document and query as a numerical vector based on the term frequencies.
TF Calculation: Compute the term frequency (TF) for each term in the documents and queries.
IDF Calculation: Compute the inverse document frequency (IDF) for each term.
TF-IDF Calculation: Multiply the TF values by the IDF values to obtain the TF-IDF matrix.

Results and Discussion

The TF matrices for documents and queries show the frequency of terms in each text.
The IDF values highlight the importance of terms across the entire corpus.
The TF-IDF matrices combine the term frequency and inverse document frequency to emphasize terms that are important in specific documents but not common across the entire corpus.

Conclusion

The project successfully demonstrates the creation of TF and TF-IDF matrices for documents and queries, providing a robust method for representing and analyzing text data in information retrieval tasks.

Installation and Usage

Clone the repository to your local machine.
Ensure you have Python installed.
Run the lab5-docs.py script to create the TF and TF-IDF matrices for documents.
Run the lab5-query.py script to create the TF and TF-IDF matrices for queries.

git clone https://github.com/KPlanisphere/tf-idf-representation.git
cd tf-idf-representation
python lab5-docs.py
python lab5-query.py

Dependencies

Python
NLTK library
Colorama library

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Document and Query Representation with TF and TF-IDF

Description

Files Included

Notable Code Snippets

1. Vectorizing Documents (lab5-docs.py)

2. Calculating IDF (lab5-query.py)

Official Documentation Summary

Objectives

Methodology

Results and Discussion

Conclusion

Installation and Usage

Dependencies

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
IDF-docs.txt		IDF-docs.txt
IDF-query.txt		IDF-query.txt
Laboratorio 5 Representación tf y tf-idf.pdf		Laboratorio 5 Representación tf y tf-idf.pdf
README.md		README.md
cueri-TRUNCADO.txt		cueri-TRUNCADO.txt
documento-TRUNCADO.txt		documento-TRUNCADO.txt
lab5-docs.py		lab5-docs.py
lab5-query.py		lab5-query.py
matriz-TFIDF-query.txt		matriz-TFIDF-query.txt
matriz-query.txt		matriz-query.txt
vocabularioReducidoT.txt		vocabularioReducidoT.txt

KPlanisphere/tf-idf-representation

Folders and files

Latest commit

History

Repository files navigation

Document and Query Representation with TF and TF-IDF

Description

Files Included

Notable Code Snippets

1. Vectorizing Documents (lab5-docs.py)

2. Calculating IDF (lab5-query.py)

Official Documentation Summary

Objectives

Methodology

Results and Discussion

Conclusion

Installation and Usage

Dependencies

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages