- Yves Gaetan Nana Teukam
- Caterina Alfano
- Meher Kavya Koppisetti
The goal of this assignment is to create a search engine for movies.
To make this possible downloaded 30000 wikipedia paged about movies and then retrieved the most important informations to create an inverted index to
compute our queries. The query's result are then ordered by similarity to the query (using the Cosine Similarity and the tdidf methods).
Laslty we also added our own ranking method to order the results of the query.
The Homework also includes an algorithmic question: how to find the length of the longest palindrome substring in a give string
Our Repository contains the following files:
README.md: a Markdown file that explains the content of your repository.collector.py: a python file that contains the line of code needed to collect data from thehtmlpage and Wikipedia.collector_utils.py: a python file that stores the function used incollector.py.parser.py: a python file that contains the line of code needed to parse the entire collection ofhtmlpages and save those intsvfiles.parser_utils.py: a python file that gathers the function used inparser.py.index.py: a python file that once executed generate the indexes of the Search engines.index_utils.py: a python file that contains the functions used for creating indexes.utils.py: a python file that gather functions needed in more than one of the previous files.main.py: a python file that once executed builds up the search engine.exercise_4.py: python file that contains the implementation of the algorithm that solves problem 4.main.ipynb: a Jupyter notebook explaines the strategies you adopted solving the homework and the Bonus point (visualization task). The notebook must be clear, complete and tidy. Here an example of a nice notebook from last year. Avoid pushing on GitHub notebook that contain entire long printed list, otherwise we will not be able to open it.