|
4 | 4 | Quick Start |
5 | 5 | =========== |
6 | 6 |
|
| 7 | +.. facet:: |
| 8 | + :name: genre |
| 9 | + :values: reference |
| 10 | + |
| 11 | +.. meta:: |
| 12 | + :keywords: tutorial, introduction, setup, begin |
| 13 | + |
7 | 14 | This tutorial is intended as an introduction to working with |
8 | | -**PyMongoArrow**. The reader is assumed to be familiar with basic |
9 | | -`PyMongo <https://pymongo.readthedocs.io/en/stable/tutorial.html>`_ and |
10 | | -`MongoDB <https://docs.mongodb.com>`_ concepts. |
| 15 | +**{+driver-short+}**. The tutorial assumes the reader is familiar with basic |
| 16 | +`PyMongo <https://pymongo.readthedocs.io/en/stable/tutorial.html>`__ and |
| 17 | +`MongoDB <https://docs.mongodb.com>`__ concepts. |
11 | 18 |
|
12 | 19 | Prerequisites |
13 | 20 | ------------- |
14 | | -Before we start, make sure that you have the **PyMongoArrow** distribution |
15 | | -:doc:`installed <installation>`. In the Python shell, the following should |
16 | | -run without raising an exception:: |
17 | 21 |
|
18 | | - import pymongoarrow as pma |
| 22 | +Ensure that you have the {+driver-short+} distribution |
| 23 | +:ref:`installed <pymongo-arrow-install>`. In the Python shell, the following should |
| 24 | +run without raising an exception: |
| 25 | + |
| 26 | +.. code-block:: python |
| 27 | + |
| 28 | + >>> import pymongoarrow as pma |
19 | 29 |
|
20 | 30 | This tutorial also assumes that a MongoDB instance is running on the |
21 | | -default host and port. Assuming you have `downloaded and installed |
22 | | -<https://docs.mongodb.com/manual/installation/>`_ MongoDB, you can start |
23 | | -it like so: |
| 31 | +default host and port. After you have `downloaded and installed |
| 32 | +<https://docs.mongodb.com/manual/installation/>`__ MongoDB, you can start |
| 33 | +it as shown in the following code example: |
24 | 34 |
|
25 | 35 | .. code-block:: bash |
26 | 36 |
|
27 | 37 | $ mongod |
28 | 38 |
|
29 | 39 | Extending PyMongo |
30 | 40 | ~~~~~~~~~~~~~~~~~ |
31 | | -The :mod:`pymongoarrow.monkey` module provides an interface to patch PyMongo, |
32 | | -in place, and add **PyMongoArrow**'s functionality directly to |
33 | | -:class:`~pymongo.collection.Collection` instances:: |
34 | 41 |
|
35 | | - from pymongoarrow.monkey import patch_all |
36 | | - patch_all() |
| 42 | +The ``pymongoarrow.monkey`` module provides an interface to patch PyMongo |
| 43 | +in place, and add {+driver-short+} functionality directly to |
| 44 | +``Collection`` instances: |
| 45 | + |
| 46 | +.. code-block:: python |
37 | 47 |
|
38 | | -After running :meth:`~pymongoarrow.monkey.patch_all`, new instances of |
39 | | -:class:`~pymongo.collection.Collection` will have PyMongoArrow's APIs, |
40 | | -e.g. :meth:`~pymongoarrow.api.find_pandas_all`. |
| 48 | + from pymongoarrow.monkey import patch_all |
| 49 | + patch_all() |
41 | 50 |
|
42 | | -.. note:: Users can also directly use any of **PyMongoArrow**'s APIs |
43 | | - by importing them from :mod:`pymongoarrow.api`. The only difference in |
44 | | - usage would be the need to manually pass the instance of |
45 | | - :class:`~pymongo.collection.Collection` on which the operation is to be |
46 | | - run as the first argument when directly using the API method. |
| 51 | +After you run the ``monkey.patch_all()`` method, new instances of |
| 52 | +the ``Collection`` class will contain the {+driver-short+} APIs-- |
| 53 | +for example, the ``pymongoarrow.api.find_pandas_all()`` method. |
| 54 | + |
| 55 | +.. note:: |
| 56 | + |
| 57 | + You can also use any of the {+driver-short+} APIs |
| 58 | + by importing them from the ``pymongoarrow.api`` module. If you do, |
| 59 | + you must pass the instance of the ``Collection`` on which the operation is to be |
| 60 | + run as the first argument when calling the API method. |
47 | 61 |
|
48 | 62 | Test Data |
49 | 63 | ~~~~~~~~~ |
50 | | -Before we begin, we must first add some data to our cluster that we can |
51 | | -query. We can do so using **PyMongo**:: |
52 | | - |
53 | | - from datetime import datetime |
54 | | - from pymongo import MongoClient |
55 | | - client = MongoClient() |
56 | | - client.db.data.insert_many([ |
57 | | - {'_id': 1, 'amount': 21, 'last_updated': datetime(2020, 12, 10, 1, 3, 1), 'account': {'name': 'Customer1', 'account_number': 1}, 'txns': ['A']}, |
58 | | - {'_id': 2, 'amount': 16, 'last_updated': datetime(2020, 7, 23, 6, 7, 11), 'account': {'name': 'Customer2', 'account_number': 2}, 'txns': ['A', 'B']}, |
59 | | - {'_id': 3, 'amount': 3, 'last_updated': datetime(2021, 3, 10, 18, 43, 9), 'account': {'name': 'Customer3', 'account_number': 3}, 'txns': ['A', 'B', 'C']}, |
60 | | - {'_id': 4, 'amount': 0, 'last_updated': datetime(2021, 2, 25, 3, 50, 31), 'account': {'name': 'Customer4', 'account_number': 4}, 'txns': ['A', 'B', 'C', 'D']}]) |
| 64 | + |
| 65 | +The following code uses PyMongo to add sample data to your cluster: |
| 66 | + |
| 67 | +.. code-block:: python |
| 68 | + |
| 69 | + from datetime import datetime |
| 70 | + from pymongo import MongoClient |
| 71 | + client = MongoClient() |
| 72 | + client.db.data.insert_many([ |
| 73 | + {'_id': 1, 'amount': 21, 'last_updated': datetime(2020, 12, 10, 1, 3, 1), 'account': {'name': 'Customer1', 'account_number': 1}, 'txns': ['A']}, |
| 74 | + {'_id': 2, 'amount': 16, 'last_updated': datetime(2020, 7, 23, 6, 7, 11), 'account': {'name': 'Customer2', 'account_number': 2}, 'txns': ['A', 'B']}, |
| 75 | + {'_id': 3, 'amount': 3, 'last_updated': datetime(2021, 3, 10, 18, 43, 9), 'account': {'name': 'Customer3', 'account_number': 3}, 'txns': ['A', 'B', 'C']}, |
| 76 | + {'_id': 4, 'amount': 0, 'last_updated': datetime(2021, 2, 25, 3, 50, 31), 'account': {'name': 'Customer4', 'account_number': 4}, 'txns': ['A', 'B', 'C', 'D']}]) |
61 | 77 |
|
62 | 78 | Defining the Schema |
63 | 79 | ------------------- |
64 | | -**PyMongoArrow** relies upon a data schema to marshall |
65 | | -query result sets into tabular form. This schema can either be automatically inferred from the data, |
66 | | -or provided by the user. Users can define the schema by |
67 | | -instantiating :class:`pymongoarrow.api.Schema` using a mapping of field names |
68 | | -to type-specifiers, e.g.:: |
69 | 80 |
|
70 | | - from pymongoarrow.api import Schema |
71 | | - schema = Schema({'_id': int, 'amount': float, 'last_updated': datetime}) |
| 81 | +{+driver-short+} relies on a data schema to marshall |
| 82 | +query result sets into tabular form. If you don't provide this schema, {+driver-short+} |
| 83 | +infers one from the data. You can define the schema by |
| 84 | +creating a ``Schema`` object and mapping the field names |
| 85 | +to type-specifiers, as shown in the following example: |
| 86 | + |
| 87 | +.. code-block:: python |
| 88 | + |
| 89 | + from pymongoarrow.api import Schema |
| 90 | + schema = Schema({'_id': int, 'amount': float, 'last_updated': datetime}) |
72 | 91 |
|
| 92 | +MongoDB uses embedded documents to represent nested data. {+driver-short+} offers |
| 93 | +first-class support for these documents: |
73 | 94 |
|
74 | | -PyMongoArrow offers first-class support for Nested data (embedded documents):: |
| 95 | +.. code-block:: python |
75 | 96 |
|
76 | | - schema = Schema({'_id': int, 'amount': float, 'account': { 'name': str, 'account_number': int}}) |
| 97 | + schema = Schema({'_id': int, 'amount': float, 'account': { 'name': str, 'account_number': int}}) |
77 | 98 |
|
78 | | -Lists (and nested lists) are also supported:: |
| 99 | +{+driver-short+} also supports lists and nested lists: |
79 | 100 |
|
80 | | - from pyarrow import list_, string |
81 | | - schema = Schema({'txns': list_(string())}) |
82 | | - polars_df = client.db.data.find_polars_all({'amount': {'$gt': 0}}, schema=schema) |
| 101 | +.. code-block:: python |
83 | 102 |
|
84 | | -There are multiple permissible type-identifiers for each supported BSON type. |
85 | | -For a full-list of data types and associated type-identifiers see |
86 | | -:doc:`data_types`. |
| 103 | + from pyarrow import list_, string |
| 104 | + schema = Schema({'txns': list_(string())}) |
| 105 | + polars_df = client.db.data.find_polars_all({'amount': {'$gt': 0}}, schema=schema) |
87 | 106 |
|
| 107 | +.. tip:: |
| 108 | + |
| 109 | + {+driver-short+} includes multiple permissible type-identifiers for each supported BSON |
| 110 | + type. For a full list of these data types and their associated type-identifiers, see |
| 111 | + :ref:`<pymongo-arrow-data-types>`. |
88 | 112 |
|
89 | 113 | Find Operations |
90 | 114 | --------------- |
91 | | -We are now ready to query our data. Let's start by running a ``find`` |
92 | | -operation to load all records with a non-zero ``amount`` as a |
93 | | -:class:`pandas.DataFrame`:: |
94 | 115 |
|
95 | | - df = client.db.data.find_pandas_all({'amount': {'$gt': 0}}, schema=schema) |
| 116 | +The following code example shows how to load all records that have a non-zero |
| 117 | +value for the ``amount`` field as a ``pandas.DataFrame`` object: |
| 118 | + |
| 119 | +.. code-block:: python |
| 120 | + |
| 121 | + df = client.db.data.find_pandas_all({'amount': {'$gt': 0}}, schema=schema) |
| 122 | + |
| 123 | +You can also load the same result set as a ``pyarrow.Table`` instance: |
96 | 124 |
|
97 | | -We can also load the same result set as a :class:`pyarrow.Table` instance:: |
| 125 | +.. code-block:: python |
98 | 126 |
|
99 | | - arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema) |
| 127 | + arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema) |
100 | 128 |
|
101 | | -a :class:`polars.DataFrame`:: |
| 129 | +Or as a ``polars.DataFrame`` instance: |
102 | 130 |
|
103 | | - df = client.db.data.find_polars_all({'amount': {'$gt': 0}}, schema=schema) |
| 131 | +.. code-block:: python |
104 | 132 |
|
105 | | -or as **Numpy arrays**:: |
| 133 | + df = client.db.data.find_polars_all({'amount': {'$gt': 0}}, schema=schema) |
106 | 134 |
|
107 | | - ndarrays = client.db.data.find_numpy_all({'amount': {'$gt': 0}}, schema=schema) |
| 135 | +Or as a NumPy ``arrays`` object: |
108 | 136 |
|
109 | | -In the NumPy case, the return value is a dictionary where the keys are field |
110 | | -names and values are corresponding :class:`numpy.ndarray` instances. |
| 137 | +.. code-block:: python |
| 138 | + |
| 139 | + ndarrays = client.db.data.find_numpy_all({'amount': {'$gt': 0}}, schema=schema) |
| 140 | + |
| 141 | +When using NumPy, the return value is a dictionary where the keys are field |
| 142 | +names and the values are the corresponding ``numpy.ndarray`` instances. |
111 | 143 |
|
112 | 144 | .. note:: |
113 | 145 |
|
114 | | - For all of the examples above, the schema can be omitted like so:: |
| 146 | + In all of the preceding examples, you can omit the schema as shown in the following |
| 147 | + example: |
115 | 148 |
|
116 | | - arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}) |
| 149 | + .. code-block:: python |
117 | 150 |
|
118 | | - In this case, PyMongoArrow will try to automatically apply a schema based on |
119 | | - the data contained in the first batch. |
| 151 | + arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}) |
120 | 152 |
|
| 153 | + If you omit the schema, {+driver-short+} tries to automatically apply a schema based on |
| 154 | + the data contained in the first batch. |
121 | 155 |
|
122 | 156 | Aggregate Operations |
123 | 157 | -------------------- |
124 | | -Running an ``aggregate`` operation is similar to ``find``, but it takes a sequence of operations to perform. |
125 | | -Here is a simple example of ``aggregate_pandas_all`` that outputs a new dataframe |
126 | | -in which all ``_id`` values are grouped together and their ``amount`` values summed:: |
127 | 158 |
|
128 | | - df = client.db.data.aggregate_pandas_all([{'$group': {'_id': None, 'total_amount': { '$sum': '$amount' }}}]) |
| 159 | +Running an aggregate operation is similar to running a find operation, but it takes a |
| 160 | +sequence of operations to perform. |
| 161 | + |
| 162 | +The following is a simple example of the ``aggregate_pandas_all()`` method that outputs a |
| 163 | +new dataframe in which all ``_id`` values are grouped together and their ``amount`` values |
| 164 | +summed: |
| 165 | + |
| 166 | +.. code-block:: python |
| 167 | + |
| 168 | + df = client.db.data.aggregate_pandas_all([{'$group': {'_id': None, 'total_amount': { '$sum': '$amount' }}}]) |
| 169 | + |
| 170 | +You can also run aggregate operations on embedded documents. |
| 171 | +The following example unwinds values in the nested ``txn`` field, counts the number of each |
| 172 | +value, then returns the results as a list of NumPy ``ndarray`` objects, sorted in |
| 173 | +descending order: |
129 | 174 |
|
130 | | -Nested data (embedded documents) are also supported. |
131 | | -In this more complex example, we unwind values in the nested ``txn`` field, count the number of each, |
132 | | -then return as a list of numpy ndarrays sorted in decreasing order:: |
| 175 | +.. code-block:: python |
133 | 176 |
|
134 | | - pipeline = [{'$unwind': '$txns'}, {'$group': {'_id': '$txns', 'count': {'$sum': 1}}}, {'$sort': {"count": -1}}] |
135 | | - ndarrays = client.db.data.aggregate_numpy_all(pipeline) |
| 177 | + pipeline = [{'$unwind': '$txns'}, {'$group': {'_id': '$txns', 'count': {'$sum': 1}}}, {'$sort': {"count": -1}}] |
| 178 | + ndarrays = client.db.data.aggregate_numpy_all(pipeline) |
136 | 179 |
|
137 | | -More information on aggregation pipelines can be found `here <https://www.mongodb.com/docs/manual/core/aggregation-pipeline/>`_. |
| 180 | +.. tip:: |
| 181 | + |
| 182 | + For more information about aggregation pipelines, see the |
| 183 | + :manual:`MongoDB Server documentation </core/aggregation-pipeline/>`. |
138 | 184 |
|
139 | 185 | Writing to MongoDB |
140 | 186 | ------------------ |
141 | | -All of these types, Arrow's :class:`~pyarrow.Table`, Pandas' |
142 | | -:class:`~pandas.DataFrame`, NumPy's :class:`~numpy.ndarray`, or :class:`~polars.DataFrame` can |
143 | | -be easily written to your MongoDB database using the :meth:`~pymongoarrow.api.write` function:: |
144 | 187 |
|
145 | | - from pymongoarrow.api import write |
146 | | - from pymongo import MongoClient |
147 | | - coll = MongoClient().db.my_collection |
148 | | - write(coll, df) |
149 | | - write(coll, arrow_table) |
150 | | - write(coll, ndarrays) |
| 188 | +You can use the ``write()`` method to write objects of the following types to MongoDB: |
151 | 189 |
|
152 | | -(Keep in mind that NumPy arrays are specified as ``dict[str, ndarray]``.) |
| 190 | +- Arrow ``Table`` |
| 191 | +- Pandas ``DataFrame`` |
| 192 | +- NumPy ``ndarray`` |
| 193 | +- Polars ``DataFrame`` |
| 194 | + |
| 195 | +.. code-block:: python |
| 196 | + |
| 197 | + from pymongoarrow.api import write |
| 198 | + from pymongo import MongoClient |
| 199 | + coll = MongoClient().db.my_collection |
| 200 | + write(coll, df) |
| 201 | + write(coll, arrow_table) |
| 202 | + write(coll, ndarrays) |
| 203 | + |
| 204 | +.. note:: |
| 205 | + |
| 206 | + NumPy arrays are specified as ``dict[str, ndarray]``. |
153 | 207 |
|
154 | 208 | Writing to Other Formats |
155 | 209 | ------------------------ |
156 | | -Once result sets have been loaded, one can then write them to any format that the package supports. |
157 | 210 |
|
158 | | -For example, to write the table referenced by the variable ``arrow_table`` to a Parquet |
159 | | -file ``example.parquet``, run:: |
160 | | - |
161 | | - import pyarrow.parquet as pq |
162 | | - pq.write_table(arrow_table, 'example.parquet') |
| 211 | +Once result sets have been loaded, you can then write them to any format that the package |
| 212 | +supports. |
163 | 213 |
|
164 | | -Pandas also supports writing :class:`~pandas.DataFrame` instances to a variety |
165 | | -of formats including CSV, and HDF. To write the data frame |
166 | | -referenced by the variable ``df`` to a CSV file ``out.csv``, for example, run:: |
| 214 | +For example, to write the table referenced by the variable ``arrow_table`` to a Parquet |
| 215 | +file named ``example.parquet``, run the following code: |
167 | 216 |
|
168 | | - df.to_csv('out.csv', index=False) |
| 217 | +.. code-block:: python |
| 218 | + |
| 219 | + import pyarrow.parquet as pq |
| 220 | + pq.write_table(arrow_table, 'example.parquet') |
169 | 221 |
|
170 | | -The Polars API is a mix of the two:: |
| 222 | +Pandas also supports writing ``DataFrame`` instances to a variety |
| 223 | +of formats, including CSV and HDF. To write the data frame |
| 224 | +referenced by the variable ``df`` to a CSV file named ``out.csv``, run the following |
| 225 | +code: |
171 | 226 |
|
| 227 | +.. code-block:: python |
| 228 | + |
| 229 | + df.to_csv('out.csv', index=False) |
172 | 230 |
|
173 | | - import polars as pl |
174 | | - df = pl.DataFrame({"foo": [1, 2, 3, 4, 5]}) |
175 | | - df.write_parquet('example.parquet') |
| 231 | +The Polars API is a mix of the two preceding examples: |
176 | 232 |
|
| 233 | +.. code-block:: python |
| 234 | + |
| 235 | + import polars as pl |
| 236 | + df = pl.DataFrame({"foo": [1, 2, 3, 4, 5]}) |
| 237 | + df.write_parquet('example.parquet') |
177 | 238 |
|
178 | 239 | .. note:: |
179 | 240 |
|
180 | | - Nested data is supported for parquet read/write but is not well supported |
181 | | - by Arrow or Pandas for CSV read/write. |
| 241 | + Nested data is supported for parquet read and write operations, but is not well |
| 242 | + supported by Arrow or Pandas for CSV read and write operations. |
0 commit comments